Efﬁcient Implementation of a Recognition System using the Cortex

Ventral Stream Model

Ahmad Bitar, Mohammad M. Mansour and Ali Chehab

Department of Electrical and Computer Engineering, American University of Beirut, Beirut 1107 2020, Lebanon

Keywords:

HMAX, Support Vector Machine, Nearest Neighbor, Caltech101.

Abstract:

In this paper, an efﬁcient implementation for a recognition system based on the original HMAX model of

the visual cortex is proposed. Various optimizations targeted to increase accuracy at the so-called layers S1,

C1, and S2 of the HMAX model are proposed. At layer S1, all unimportant information such as illumination

and expression variations are eliminated from the images. Each image is then convolved with 64 separable

Gabor ﬁlters in the spatial domain. At layer C1, the minimum scales values are exploited to be embedded

into the maximum ones using the additive embedding space. At layer S2, the prototypes are generated in

a more efﬁcient way using Partitioning Around Medoid (PAM) clustering algorithm. The impact of these

optimizations in terms of accuracy and computational complexity was evaluated on the Caltech101 database,

and compared with the baseline performance using support vector machine (SVM) and nearest neighbor (NN)

classiﬁers. The results show that our model provides signiﬁcant improvement in accuracy at the S1 layer by

more than 10% where the computational complexity is also reduced. The accuracy is slightly increased for

both approximations at the C1 and S2 layers.

1 INTRODUCTION

The human visual system is very powerful; it can rec-

ognize and differentiate among numerous similar ob-

jects in a very selective, robust and fast manner. Mod-

ern computers are able to translate the human ventral

visual pathway (known as the “WHAT” stream) in or-

der to achieve, in a similar manner to the human brain,

an impressive trade-off between selectivity and invari-

ance. Several scientists have attempted to model and

mimic the human vision system (Serre et al., 2005b).

The Hierarchical Model And X (HMAX) is an im-

portant model for object recognition in the visual cor-

tex known for its high ability to achieve performance

levels close to the human object recognition capabil-

ity (Serre et al., 2005a). HMAX divides the human

ventral stream into ﬁve layers: S1, C1, S2, C2 and

View-TUned (VTU).

The ﬁrst layer S1 of the HMAX model relies on

the Gabor ﬁlter (Amayeh et al., 2009), which is a lin-

ear ﬁlter used for edge detection. It differs from other

ﬁlters by its capability to highlight all the features that

are oriented in the direction of the ﬁltering. The fea-

tures are therefore extracted from the images by tun-

ing the gabor ﬁlter to several different scales and ori-

entations using ﬁne-to-coarse approach.

Several methods have been proposed in the liter-

ature in order to improve the efﬁciency of the orig-

inal HMAX model. Cadieu et al. (2005) proposed

an extension of the original HMAX model, empha-

sizing the importance of shape selectivity in area

V4. A simpler radial basis function (RBF) model

for object recognition was proposed by Bermudez-

Contreras et al. (2008) to maintain a good degree

of translation and scale invariance. The proposed

model was considered better than the original HMAX

for translation and scale invariance by changing the

point of attention and decreasing the amount of vi-

sual information to be processed. Serre and Riesen-

huber (2004) developed a new set of receptive ﬁeld

shapes and parameters for cells in the S1 and C1 lay-

ers. The method serves to increase position invariance

in contrast to scale invariance, which is decreased.

Serre et al. (2007b) proposed a general framework

for robust object recognition of complex visual scenes

based on a quantitative theory of the ventral path-

way of visual cortex. A number of improvements to

the base model were proposed by Mutch and Lowe

(2006) in order to increase the sparsity. The pro-

posed model has shown a remarkable improvement

on classiﬁcation performance and the resulting model

is found more economical in terms of computations.

138

Bitar A., Mansour M. and Chehab A..

Efﬁcient Implementation of a Recognition System using the Cortex Ventral Stream Model.

DOI: 10.5220/0005308901380147

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 138-147

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Chikkerur and Poggio (2011) proposed several ap-

proximations at the four HMAX layers (S1, C1, S2

and C2) in order to increase the efﬁciency of the

model in terms of accuracy and computational com-

plexity. Holub and Welling (2005) proposed a semi-

supervised learning algorithm for visual object cat-

egorization by exploiting unlabelled data and em-

ploying a hybrid generative-discriminative learning

scheme. The method achieved good performance in

multi-class object discrimination tasks. Grauman and

Darrell (2005) proposed a scheme based on a kernel

function for discriminative classiﬁcation. The method

achieved improved accuracy and reduced computa-

tional complexity compared to the baseline model.

In this paper, the goal is to perform various op-

timizations at the S1, C1 and S2 layers of the origi-

nal HMAX model. The results demonstrate that these

optimizations increase the accuracy of the HMAX

model as well as reduce its computational complex-

ity at the S1 layer. The accuracy of the ﬁnal model

proves the advantage of exploiting only the important

features for recognition and generating the prototypes

in a more efﬁcient way.

The remainder of this paper is organized as fol-

lows. In section 2, a brief overview of the original

HMAX model is explained. The proposed approxi-

mations at S1, C1 and S2 layers are presented in sec-

tion 3, 4 and 5, respectively. Experimental results are

shown in section 6. Finally, section 7 gives conclud-

ing remarks and some directions for future work.

2 HMAX MODEL WITH

FEATURE LEARNING

HMAX (Serre et al., 2007a) is a computational model

that summarizes the organization of the ﬁrst few

stages of object recognition in the WHAT pathway

of the visual cortex, which is located in the occipital

lobe at the back of the human brain. It is considered a

primordial part of the cerebral cortex responsible for

processing visual information in the ﬁrst 100-150 ms.

Indeed, light enters our eye from the central aperture,

called “Pupil”, and then passes through the “Crys-

talline lens” which is considered the biconvex trans-

parent body situated behind the iris into the eye and

aiming to focus light on the retina that sends images to

a speciﬁc part of the brain (visual cortex) through the

optic nerve. The retina contains ﬁve different types

of connected neurons: Photoreceptors (95% rods and

5% cones), Horizontal, Bipolar, Amacrine and Gan-

glion through which the light leaves the eye. The

visual cortex, located in and around the calcarine

sulcus, refers to the striate cortex V1, anatomically

equivalent to Brodmann area 17 (BA17), connected to

several extrastriate visual cortical areas, anatomically

equivalent to Brodmann area 18 and Brodmann area

19. The right and left V1 receive information from the

right and left Lateral Geniculate Nucleus (LGN), re-

spectively. The LGNs are located in the thalamus of

the brain and they receive information directly from

the ganglion cells of the retina via the optic nerve and

optic chiasm.

2.1 Computational Complexity

The operations of the ﬁve layers of the HMAX model

are brieﬂy summarized.

S1 Layer: All the responses of the S1 units are

summarized here by simply performing 2-D convolu-

tion between 64 Gabor ﬁlters (16 scales in steps of

two pixels and 4 orientations) shown in Figure 1 and

the input images in the spatial domain.

Firstly, each Gabor ﬁlter of a speciﬁc scale and orien-

tation can be initialized as:

G(x,y) = exp

−



+γ

2σ



×cos



2π



, (1)

where:

u = xcosθ + ysinθ,

v = −xsinθ + ycosθ,

γ = 0.0036 × ρ

+ 0.35 × ρ + 0.18,

λ =

0.8

The parameter γ is the aspect ratio at a particular scale,

θ is the orientation ∈ [0

◦

, 45

◦

, 90

◦

, 135

◦

], σ repre-

sents the effective width (=0.3 in our case), λ is the

wavelength at a particular scale, and ρ represents the

scale.

Secondly, all the S1 image responses are com-

puted by applying a two dimensional convolution be-

tween the initialized Gabor ﬁlters and the input im-

ages in the spatial domain. The S1 image responses

are so-called: the Gabor features.

In fact, all the ﬁlters are arranged in 8 bands.

There are two ﬁlter scales with four orientations at

each band.

The S1 layer has a computational complexity of

O(N

) where M × M is the size of the ﬁlter and

N × N is the size of the image

C1 Layer: The C1 units are considered to have

larger receptive ﬁeld sizes and a certain degree of po-

sition and scale invariance. For each band, each C1

unit response (image response) is computed by tak-

ing the maximum pooling between the gabor features

of the two scales at the same orientation. The main

role of the maximum pooling function is to subsample

EfficientImplementationofaRecognitionSystemusingtheCortexVentralStreamModel

139

the number of the S1 image responses and increase

tolerence to stimulus translation and scaling. Then,

the pooling over local neighborhood using a grid of

size n × n is performed. From band 1 to 8, the value

of n starts from 8 to 22 in steps of two pixels, re-

spectively. Furthermore, a subsampling operation can

also be performed by overlapping between the recep-

tive ﬁelds of the C1 units by a certain amount ∆

band1

band2

,··· , 11

band8

), given by the value of the

parameter C1Overlap. The value C1Overlap = 2 is

mostly used, meaning that half the S1 units feeding

into a C1 unit were also used as input for the adjacent

C1 unit in each direction. Higher values of C1Overlap

indicate a greater degree of overlap. This layer has a

computational complexity of O(N

M).

S2 Layer: The original version of HMAX was

the standard model in which the connectivity from C1

to S2 was considered hard-coded to generate several

combinations of C1 inputs. The model was not able

to capture discriminating features to distinguish facial

images from natural images. To improve that, an ex-

tended version was proposed (Serre et al., 2005b), and

is called HMAX with feature learning. In this model,

each S2 unit acts as a Radial Basis Function (RBF)

unit, which serves to compute a function of the dis-

tance between the input and each of the stored proto-

types learned during the feature learning stage. That

is, for an image patch X from the previous C1 layer at

a particular scale, the S2 response (image response) is

given by:

out

= exp

(

−βkX−P

)

, (2)

where β represents the sharpness of the tuning, P

the ith prototype and k·k represents the Euclidean dis-

tance. This layer has a computational complexity of





, where P is the number of prototypes.

C2 Layer: It is considered the layer at which the

ﬁnal invariance stage is provided by taking the maxi-

mum response of the corresponding S2 units over all

scales and orientations. The C2 units provide input to

the VTUs. This layer has a computational complexity

of O(N

MP).

VTU Layer: At runtime, each image in the

database is propagated through the four layers de-

scribed above. The C1 and C2 features are extracted

and further passed to a simple linear classiﬁer. Typ-

ically, support vector machine (SVM) and nearest

neighbor (NN) classiﬁers are employed.

The Learning Stage: The learning process aims to

randomly select P prototypes used for the S2 units.

They are selected from a random image at the C1

layer by extracting a patch of size 4×4, 8×8, 12×12,

or 16 × 16 at random scale and position (Bands 1 to

8). For an 8 × 8 patch size for example, it contains 8

× 8 × 8 = 512 C1 unit values instead of 64. This is

expected since for each position, there are units rep-

resenting each of the four orientations [0

◦

, 45

◦

, 90

◦

135

◦

3 S1 LAYER APPROXIMATIONS

At the S1 layer, several approximations are investi-

gated in order to increase the efﬁciency of the origi-

nal HMAX model in terms of accuracy and computa-

tional complexity. Each approximation has been eval-

uated independently using SVM and NN classiﬁers.

3.1 Combined Image-based HMAX

using 2-D Gabor Filters

In this approximation, all unimportant information

such as illumination and expression variations are

eliminated from the image and hence its salient fea-

tures become richer (Sharif et al., 2012). To achieve

this, four main steps are applied to the original image

A of size h × a:

Step 1 – Adaptive Histogram Equalization: In order

to handle the large intensity values to some extent,

adaptive histogram equalization is applied to the orig-

inal image A:

Adapted Image = AdaptHistEq(A) (3)

Step 2 – SVD Decomposition: Singular value decom-

position (SVD) is applied to the image after equaliza-

tion. The concept behind SVD is to break down the

image into the product of three different martices as:

SVD(Adapted

Image) = L × D × R

(4)

where L is the orthogonal matrix of size h × h, R

the transpose of an orthogonal matrix R of size a × a

and D is the diagonal matrix of size h × a.

This decomposition helps the computations to be

more immune to numerical errors, as well as to

expose the substructure of the original image more

clearly and orders their elements from most amount

of variation to the least.

Step 3 – Reconstruction Image: According to

the values of L, D and R, the reconstructed image is

computed as follows:

Reconstructed Image = L ∗ D

∗R

, (5)

where α is a magniﬁcation factor that varies between

1 and 2. The idea to have the value of α vary between

one and two in order to magnify the singular values of

D is to make them invariant to illumination changes.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

140

Figure 1: 64 Gabor ﬁlters (16 scales in steps of two pixels [7 × 7 to 37 × 37] × 4 orientations [0

◦

, 45

◦

, 90

◦

, 135

◦

]).

When α equals to 1, the reconstructed image is equiv-

alent to the equalized image. When α is chosen be-

tween ]1 2], then the singular values greater than unity

will be magniﬁed. Thus, the combination between the

reconstructed image and the equalized image will be a

fruitful step to making the model more robust against

illumination and expression variations.

Interestingly, when the singular values are scaled in

the exponent, a non-linearity is introduced. Therefore

for a speciﬁc database (Caltech101 for example),

scaling down the magniﬁcation factor α may be

helpful.

Step 4 – Combined Image: The combined image

is produced by simply combining the reconstructed

image and the equalized image as shown in Figure

2, using a combination parameter c which varies

between 0 and 1.

Comb

Adapted Image + (c ∗ Reconstructed Image)

1 + c

(6)

By applying this approximation, the computations in

this layer become faster as shown in Figure 5 since

only the signiﬁcant information are used for recogni-

tion. In addition, the approximation can signiﬁcantly

improve the model’s accuracy. It can be explained by

the fact that when the model uses a challenge database

such as Calech101 or Caltech256 in which there are

a lot of unimportant information such as illumination

and expression variations, it will be interesting to ex-

ploit only the most important features in the images in

order to make the recognition easier and more robust

where the accuracy is increased by 10% using SVM

while by more than 13% when using NN classiﬁer.

There are no related works yet that approximate the

S1 layer.

3.2 Combined Image-based HMAX

using Separable Gabor Filters

In this approximation, all the combined images of the

previous approximation are convolved with 64 Gabor

ﬁlters in a separable manner (G(x, y) = f (x)g(y)), in-

stead of just performing the 2-D convolution. In this

case, the Gabor features are computed using two 1-

D convolutions corresponding to convolution by f (x)

in the x-direction and g(y) in the y-direction. Based

on the deﬁnition of separable 2-D ﬁlters, the Gabor

ﬁlters are parallel to the image axes (θ = kπ/2). In

order to be applied to an image along diagonal direc-

tions, they have been extended to further work with

θ = kπ/4. The main issue of these techniques is that

they will not work with any other desired direction.

To handle this problem, eq. (1) can be rewritten us-

ing the isotropic version (γ = 1, circular) in the com-

plex domain (Chikkerur and Poggio, 2011). In this

case, u

+ v

= (xcosθ +ysinθ)

+ (−xsinθ +ycosθ)

= x

+ y

G(x,y)=e

−

2σ

×cos



2π

(x cos(θ)+y sin(θ)



= Re( f (x)g(y))

where

f (x) = e

−

2σ

× e

ix cos(θ)

g(y) = e

−

2σ

× e

iysin(θ)

Finally, the convolution using this approximation

can therefore be expressed as:

Comb

∗ G(x,y) = I

Comb

(x,y) ∗ f (x) ∗ g(y) (7)

By exploiting the separability of Gabor ﬁlters and

convolving them with the original image, the com-

putational complexity is reduced from O(N

) to

O(tN

M) where t=8 due to complex valued arith-

metic. But since in this approximation, the separable

Gabor ﬁlters are convolved with the combined image

Comb

, the complexity is being more reduced since

only the signiﬁcant information are used for recog-

nition. The accuracy is not increased by more than

10.5% for SVM (between 10.4% and 10.5%) while is

increased by more than 14% for the NN classiﬁer.

EfficientImplementationofaRecognitionSystemusingtheCortexVentralStreamModel

141

Figure 2: (a) The original image and (b) Combined images using α = 0.25, 0.5, 0.75, 1 and 1.25, respectively. c is equal to

0.25 and 0.75 on the top and bottom, respectively.

4 C1 LAYER APPROXIMATIONS

Concerning the C1 layer, a pooling between the S1 re-

sponses over scales within each band is performed by

simply taking the maximum response between them.

By testing what can be the result of the minimum

pooling that has not been exploited at this layer, it was

noticed that all the minimum scales values are very

close to their corresponding maximum ones. Some

of them are equal, otherwise the most of minimum

scales values are not smaller more than 6 or 7%. As

such, it will be important to further consider some of

the minimum scales values when taking the maximum

pooling. In other words, some of the minimum scales

values can be exploited in addition to the maximum

ones in order to increase the model’s accuracy. But

the remaining question to be solved is ”How to take

advantage of minimum and maximum scales values

at the same time”. So that under a speciﬁc conditions,

some of the minimum scales values can be embedded

into their corresponding maximum ones. The easiest

way to achieve that is to apply the embedding in the

additive domain. A general scheme of this approxi-

mation is shown in Figure 3. In this ﬁgure, two S1

image responses I

and I

of the same orientation at

the ﬁrst band (band1) are considered and which are

belong to the ﬁlter scale 7 and 9, respectively. The cir-

cles shown within the images correspond to their pix-

els. In step 1, the maximum pooling (max function)

is performed between I

and I

. The pixels of the re-

sulting image correspond to the maximum scales val-

ues (shown with blue circles). In step 2, the mini-

mum pooling (min function) is performed between I

and I

. The pixels of the resulting image correspond

to the minimum scales values (shown with violet cir-

cles) that are then embedded into their corresponding

maximum ones in the additive domain under speciﬁc

conditions as shown in step 3. In other words, each

minimum scale value is added into the maximum one

that has the same (x, y) coordinates. w is the weight

of the embedding.

Embedding in the Additive Domain: This kind of

embedding is very straightforward to implement since

the minimum scales values (after applying the mini-

mum pooling over scales within each band) can be di-

rectly embedded into their corresponding maximum

values by simply using the addition operator.

Generally, the embedding process at a particular

pixel coordinate (x, y) in the additive domain can be

expressed as:

Embed

(x,y) = max

scale

(x,y) + w ∗ min

scale

(x,y), (8)

where I

Embed

(x,y) represents the ﬁnal result after the

embedding process, max

scale

is the maximum scale

value, min

scale

is the minimum scale value, and w ∈

[0, 1] represents the weight of the embedding.

Two different conditions are considered to embed

the minimum scales values into their corresponding

maximum ones:

Condition 1: At each band, after computing the

maximum pooling over scales of the same orientation,

the minimum pooling is also performed and then all

the minimum scales values are embedded into their

corresponding maximum ones. In this case, w is set

to 1.

Condition 2: Each minimum scale value is embed-

ded if and only if its corresponding maximum value

belongs to the interval [0% 5%[. The values within

the interval speciﬁes how much a maximum scale

value is greater than its corresponding minimum one.

In fact, the interval [0% 5%[ is divided into two

groups: [0% 2%[ and [2% 5%[, and two distinct sub-

conditions are thus considered:

• Sub-condition 1: The embedding is performed by

setting w to 1 for [0% 2%[ and 0.5 for [2% 5%[.

• Sub-condition 2: The embedding is performed by

setting w to 0.5 for [0% 2%[ and 0.1 for [2% 5%[.

The accuracy is not increased by more than 1% in

all conditions when SVM is used, while the opposite

for NN classiﬁer. However, the computational com-

plexity at this layer is slightly increased due to the

embedding process.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

142

Figure 3: Scheme example of the C1 approximation.

5 S2 LAYER APPROXIMATIONS

At the S2 layer, the focus is to enhance the manner

by which all the prototypes are selected during the

feature learning stage. In the original model, P proto-

types are randomly selected from the training images

at the C1 layer. If more than P prototypes are used, the

model’s accuracy will increase at the expense of addi-

tional computational complexity. That is why our mo-

tivation is to learn the same number of prototypes P

but in an efﬁcient way in order to decrease the model’s

false classiﬁcation rate while keeping the same com-

putational complexity.

In order to achieve this, clustering is exploited,

which is considered one of the most important re-

search areas in the ﬁeld of data mining. It aims to di-

vide the data into groups, (clusters) in such a way that

data of the same group are similar and those in other

groups are dissimilar. Clustering is considered useful

to obtain interesting patterns and structures. That is

why, one of the existing clustering algorithms, more

speciﬁcally the Partitioning Around Medoid (PAM)

clustering algorithm (Kumar and Wasan, 2011) has

been exploited in this approximation to generate the

prototypes.

Furthermore, one of the important issues to con-

sider, is the redundancy of some prototypes especially

those selected from the homogeneous areas of the

image (prototypes’ pixels are being equal to zero).

That is why, our contribution also aims to generate a

non-redundant P prototypes and force the model not

to generate any unimportant prototype. Accordingly,

each of the selected prototypes will be important and

aims to increase the model’s accuracy.

PAM is characterized by its robustness to the pres-

ence of noise and outliers. Its complexity is deﬁned

by O(i(b − q)

) where i is the number of iterations,

q is the number of clusters, and b represents the total

number of objects in the data set.

To generate 2000 prototypes in a more efﬁcient

way and use them in our model instead of the tradi-

tional ones, the PAM algorithm is performed and it

consists of 6 different steps:

Step 1 – 5 medoids of 4 × 4 pixels at four orientations

of each training category (total of 30 images) from

the total 102 categories are randomly initialized.

Step 2 – For each category, the Frobenius distance

between each of the C1 response of each image

with all the selected medoids is then computed in or-

der to associate each data image to the closest medoid.

Step 3 – For a random cluster, a non-medoid image

patch is randomly selected in order to be swaped

with the original medoid of the cluser in which the

non-medoid is selected.

Step 4 – steps 2 and 3 are repeated until the total cost

of swapping becomes greater than zero. The total cost

of swapping can be deﬁned as follows:

Cost

swapping

= Current Total Cost − Past Total Cost

Step 5 –All the previous steps are also performed for

all the other remaining three sizes of the medoids (8

× 8, 12 ×12 and 16×16) in order to have a total of

20 medoids in each category.

Step 6 –Finally, a total of 2040 medoids are being

selected to be used as prototypes. 10 prototypes are

dropped from each size in order to end up with only

2000 prototypes.

This algorithm is complex since there are six steps

to perform in order to generate the prototypes. But in

fact, the run of the HMAX model relies on two parts.

The ﬁrst part is responsible to generate and reserve all

the necessary prototypes by only running the ﬁrst two

layers S1 and C1. The second part consists of running

the whole model and use the prototypes that have been

generated and reserved for the S2 layer. Interestingly,

EfficientImplementationofaRecognitionSystemusingtheCortexVentralStreamModel

143

the complexity of the model depends only on the sec-

ond part, which means that the large complexity of

our algorithm does not affect the computational com-

plexity of the model, more precisely, of the S2 layer.

That is why, the computational complexity at the S2

layer of our model remains O





, where P is

the number of prototypes.

By applying this approximation, the accuracy of the

model incerases by 0.68% approximately using the

SVM classiﬁer.

6 EXPERIMENTAL RESULTS

The proposed optimizations at the S1, C1 and S2 lay-

ers were implemented using MATLAB in order to

evaluate their accuracy and computational complexity

using experimental simulations. The S1, C1 and S2

approximations were evaluated using the Caltech101

database, which contains a total of 9,145 images split

between 101 distinct object categories in addition to

a background category. All the results of our approx-

imations were the average of 3 independent runs. For

each run, the following steps were performed:

1. A set of 30 images are randomly chosen from each

category for training, while all the remaining im-

ages are used for testing. All the images are nor-

malized to 140 pixels in height and the width is

rescaled accordingly so that the image aspect ra-

tio is preserved.

2. C1 sub-sampling ranges do not overlap in scales.

3. The prototypes are learned at random scales and

positions. They are extracted from all the eight

bands.

4. C2 vectors are built using the training set.

5. Training applied using both SVM and Nearest-

Neighbor classiﬁers.

6. C2 vectors for the test set are built, and then the

test images are classiﬁed.

6.1 Performance of SVM and NN

The performance of both SVM and NN are performed

on the face category extracted from the caltech101

database and which contains 435 face images. The

images were rescaled to 160 × 160 pixels, the C1

sub-sampling ranges overlap in scale (C1Overlap =

2) and the prototypes are chosen only from Bands 1

and 2. The classiﬁers were trained with n = 50, 100,

150, 200, 250, 300, 350 and 400 positive examples

and 50 negative examples from the background class,

while they are tested with all the remaining positive

examples and 50 examples from the negative set as

shown in Table 1 and Figure 4. 1000 prototypes (250

patches) × (4 sizes) are used in the S2 layer.

The results show that the accuracy decreases when

the number of training becomes greater than 300.

This is expected because the data becomes unbal-

anced.

Table 1: Simulation results for SVM and NN on face cate-

gory.

Positive training SVM NN

50 92.325% 52.903%

100 95.678% 79.578%

150 96.656% 88.240%

200 97.018% 90.658%

250 97.075% 92.181%

300 97.165% 92.634%

350 96.019% 91.480%

400 94.558% 87.649%

100 200 300 400

Number o f positive training

Correct classi f ication rate (%)

SVM

Figure 4: SVM and NN accuracies on face category.

6.2 Evaluations at the S1 Layer

At this layer, the computational complexity and cor-

rect classiﬁcation rates (accuracies) for each of the

proposed approximations (Approx) are compared to

the baseline model.

• Approx1: Combined Image-based HMAX using

2-D Gabor ﬁlters.

• Approx2: Combined Image-based HMAX using

separable Gabor ﬁlters.

In order to compute the speed of the approximations

at this layer, the total time complexity of the S1 layer

is measured on a speciﬁc face image from the face

category. All the evaluations were done on a core i7

2.4 GHZ machine. The simulations were repeated ﬁve

times. Figure 5 illustrates an average of the results. It

shows that both Approx1 and Approx2 are faster than

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

144

the baseline (blue curve) for all the tested image sizes.

It has been noticed that for an image of size between

100x100 and 160x160, Approx1 is always faster than

Approx2. For example, Approx1 is faster than Ap-

prox2 by 3.23% for an image of size 100x100. For

other image sizes greater than or equal to 160x160,

Approx2 always shows lower timing than Approx1.

For example, for an image of size 160x160, Approx1

is faster than the baseline by 2.95% while by 3.42%

for Approx2. For an image of size 256x256, Approx1

is faster than the baseline by 6.29% while by 12.56%

for Approx2.

100x100

160x160

256x256

512x512

640x480

0.5

1.5

Time (s)

Baseline

Approx1

Approx2

Figure 5: Timing comparison (in s).

In order to assess the correct classiﬁcation rates,

both SVM and NN classiﬁers were used. The average

accuracies of Approx1 under different values of α and

c are shown in Table 2. From all the following exper-

iments, 2000 prototypes (500patches) × (4 sizes) are

used and all the images were rescaled to 140 in height.

Recall that C1 sub-sampling ranges do not overlap in

scales and the prototypes are extracted from all the

eight bands. The performance of the original model

reaches 39% and 21.2% when using 30 training exam-

ples per class averaged over 3 repetitions under SVM

and NN, respectively. Table 2 proves our signiﬁcant

contribution at the S1 layer especially for α = 0.75 and

c = 0.25 where the accuracy is increased by 10.02%

and 13.811% using SVM and NN, respectively.

Figure 6 illustrates the results shown in Table 2.

It shows 4 different curves. The red and blue solid

curves represent the accuracy values for c = 0.25 un-

der SVM and NN, respectively. While the red and

blue dashed curves are for c = 0.75 under SVM and

NN, respectively.

Finally, the separability of Gabor ﬁlters is ex-

ploited and applied to the combined image with α =

0.75 and c = 0.25. Approx2 shows an accuracy equal

to 49.471% and 35.372% for SVM and NN, respec-

tively.

Table 2: Classiﬁcation accuracies of Approx1 approxima-

tion.

Approx1 Classiﬁer c = 0.25 c = 0.75

α = 0.25 SVM 34.3652% 33.5927%

NN 20.284% 20.283

α = 0.5 SVM 45.201% 45.158%

NN 31.23% 31.231

α = 0.75 SVM 49.02% 48.269%

NN 35.011% 34.455%

α = 1 SVM 47.739% 47.739%

NN 31.362% 31.362%

α = 1.25 SVM 39.3531% 40.743

NN 23.998% 24.08%

0.25 0.5 0.75 1 1.25

accuracy (%)

Figure 6: Approx1 accuracies under different values of α

and c.

6.3 Evaluations at the C1 Layer

Figure 7 shows the average accuracies of the SVM

(blue x points) and NN (red x points) on several

C1 optimization options (Opt). The accuracy of the

model is increased a little bit when three cases of

the additive method are applied. For example, using

SVM, the accuracy is increased by 0.577%, 0.607%,

0.843% on Opt1, Opt2 and Opt3, respectively. On

the other hand, the increase is 0.846%, 1.88%, 1.85%

using NN.

• Opt1: Embedding all pixels (α = 1).

• Opt2: [0%,2%[, (α = 0.5); [2%,5%[, (α = 0.1)

• Opt3: [0%,2%[, (α = 1); [2%,5%[, (α = 0.5)

6.4 Evaluations at the S2 Layer

Table 3 shows the average accuracy of the SVM clas-

siﬁer based on the S2 approximation.

This approximation has a big advantage on

the model since the selected prototypes are non-

redundant and generated in more intelligent way.

EfficientImplementationofaRecognitionSystemusingtheCortexVentralStreamModel

145

Baseline

Opt1

Opt2

Opt3

Accuracy (%)

SVM

Figure 7: Average accuracies of C1 approximations.

Table 3: Accuracy of S2 approximation.

Approximation SVM

Baseline 39%

PAM 39.68% (+0.68%)

Therefore, each prototype serves to slightly increase

the accuracy. The accuracy of the model is increased

approximately by 0.68%.

6.5 Combined Classiﬁcation Accuracies

Figure 8 shows the average accuracies of SVM on

the combination of the approximations ”Approx2” +

”Opt3” + ”PAM”. Our model shows an accuracy

equal to 51% when using only 2000 prototypes while

it shows 53.8% when using higher number of pro-

toypes (4080) as used by Serre et al. (2005b), Mutch

and Lowe (2006), Holub and Welling (2005), Grau-

man and Darrell (2005).

7 DISCUSSION AND FUTURE

WORK

In this study, the complexity of all the ﬁve different

layers of the original model of object recognition in

the visual cortex, HMAX, is presented. Different ap-

proximations were added to the ﬁrst three layers S1,

C1 and S2.

The results showed that removing all unimportant in-

formation such as illumination, expression variations

and occlusions, to be a fruitful approach to improving

performance. The idea behind separability of Gabor

ﬁlters has been also exploited in order to be applied

on the combined images generated after keeping only

the important features for recognition. The change

of the main concept at the C1 layer is further applied

Serre et al. (2005b)

Holub and Welling (2005)

Our model

Our model’

Mutch and Lowe (2006)

Grauman and Darrell (2005)

53.8

58.2

Accuracy (%)

Figure 8: The accuracies of the ﬁnal models.

by exploiting the advantage of some of the minimum

scales values and using them to be embedded into the

extracted maximum scales values. The accuracy was

slightly increased when the embedding process has

been applied using the additive method. Our model

serves also to always use an intelligent version of se-

lected prototypes at the S2 layer in order to remove

all the possibilities of having an unimportant proto-

type aiming to decrease the model’s accuracy.

As for future enhancements, a likely ﬁrst step

would be to attempt to extend our work to the HMAX

model in color mode. In addition, several new ap-

proximations will be applied and tested on more chal-

lenging databases.

REFERENCES

Amayeh, G., Tavakkoli, A., and Bebis, G. (2009). Accurate

and efﬁcient computation of gabor features in real-

time applications. In proceeding of the 5th Interna-

tional Symposium on Advances in Visual Computing:

Part I, volume 5875 of Lecture Notes in Computer Sci-

ence, pp. 243-252.

Bermudez-Contreras, E., Buxton, H., and Spier, E. (2008).

Attention can improve a simple model for object

recognition. In Image and Vision Computing, vol. 26,

pp. 776-787.

Cadieu, C., Kouh, M., Riesenhuber, M., and Poggio, T.

(2005). Shape representation in v4: Investigating

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

146

position-speciﬁc tuning for boundary conformation

with the standard model of object recognition. In Jour-

nal of vision, Vol. 5, no. 8.

Chikkerur, S. and Poggio, T. (2011). Approximations in

the hmax model. In MIT-CSAIL-TR-2011-021, CBCL-

298, 12p.

Grauman, K. and Darrell, T. (2005). The pyramid match

kernel: Discriminative classiﬁcation with sets of im-

age features. In Proceedings of the IEEE International

Conference on Computer Vision (ICCV), pp. 1458 -

1465, vol. 2.

Holub, A. and Welling, M. (2005). Exploiting unlabelled

data for hybrid object classiﬁcation. In Advances in

Neural Information Processing Systems (NIPS 2005)

Workshop in Inter-Class Transfer.

Kumar, P. and Wasan, S. K. (2011). Comparative study

of k-means, pam and rough k-means algorithms using

cancer datasets. In proceedings of CSIT: 2009 Inter-

national Symposium on Computing, Communication,

and Control (ISCCC Singapore, 2011, pp. 136–140.

Mutch, J. and Lowe, D. G. (2006). Multiclass object recog-

nition with sparse, localized features. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), vol.1, pp. 11-18.

Serre, T., Kouh, M., Cadieu, C., Knoblich, U., Kreiman,

G., and Poggio, T. (2005a). A theory of object

recognition: computations and circuits in the feedfor-

ward path of the ventral stream in primate visual cor-

tex. In CBCL Paper #259/AI Memo #2005-036, Mas-

sachusetts Institute of Technology, Cambridge, MA.

Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich,

U., and Poggio, T. (2007a). A quantitative theory of

immediate visual recognition. In Progress in Brain

Research, Computational Neuroscience: Theoretical

Insights into Brain Function, vol. 165, pp. 33-56.

Serre, T. and Riesenhuber, M. (2004). Realistic model-

ing of simple and complex cell tuning in the hmax

model, and implications for invariant object recog-

nition in cortex. In Massachusetts Institute of Tech-

nology, Cambridge, MA. CBCL, Paper 239/Al Memo

2004-017.

Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and

Poggio, T. (2007b). Robust object recognition with

cortex-like mechanisms. In IEEE Conference on Pat-

tern Analysis and Machine Intelligence, vol.29, pp.

411-426.

Serre, T., Wolf, L., and Poggio, T. (2005b). Object recogni-

tion with features inspired by visual cortex. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR’05), pp. 994-1000.

Sharif, M., Anis, S., Raza, M., and Mohsin, S. (2012). En-

hanced svd based face recognition. In Journal of Ap-

plied Computer Science & Mathematics, no. 12, p.49.

EfficientImplementationofaRecognitionSystemusingtheCortexVentralStreamModel

147