the number of the S1 image responses and increase
tolerence to stimulus translation and scaling. Then,
the pooling over local neighborhood using a grid of
size n × n is performed. From band 1 to 8, the value
of n starts from 8 to 22 in steps of two pixels, re-
spectively. Furthermore, a subsampling operation can
also be performed by overlapping between the recep-
tive fields of the C1 units by a certain amount ∆
s
(=
4
band1
,5
band2
,··· , 11
band8
), given by the value of the
parameter C1Overlap. The value C1Overlap = 2 is
mostly used, meaning that half the S1 units feeding
into a C1 unit were also used as input for the adjacent
C1 unit in each direction. Higher values of C1Overlap
indicate a greater degree of overlap. This layer has a
computational complexity of O(N
2
M).
S2 Layer: The original version of HMAX was
the standard model in which the connectivity from C1
to S2 was considered hard-coded to generate several
combinations of C1 inputs. The model was not able
to capture discriminating features to distinguish facial
images from natural images. To improve that, an ex-
tended version was proposed (Serre et al., 2005b), and
is called HMAX with feature learning. In this model,
each S2 unit acts as a Radial Basis Function (RBF)
unit, which serves to compute a function of the dis-
tance between the input and each of the stored proto-
types learned during the feature learning stage. That
is, for an image patch X from the previous C1 layer at
a particular scale, the S2 response (image response) is
given by:
S2
out
= exp
(
−βkX−P
i
k
2
)
, (2)
where β represents the sharpness of the tuning, P
i
is
the ith prototype and k·k represents the Euclidean dis-
tance. This layer has a computational complexity of
O
PN
2
M
2
, where P is the number of prototypes.
C2 Layer: It is considered the layer at which the
final invariance stage is provided by taking the maxi-
mum response of the corresponding S2 units over all
scales and orientations. The C2 units provide input to
the VTUs. This layer has a computational complexity
of O(N
2
MP).
VTU Layer: At runtime, each image in the
database is propagated through the four layers de-
scribed above. The C1 and C2 features are extracted
and further passed to a simple linear classifier. Typ-
ically, support vector machine (SVM) and nearest
neighbor (NN) classifiers are employed.
The Learning Stage: The learning process aims to
randomly select P prototypes used for the S2 units.
They are selected from a random image at the C1
layer by extracting a patch of size 4×4, 8×8, 12×12,
or 16 × 16 at random scale and position (Bands 1 to
8). For an 8 × 8 patch size for example, it contains 8
× 8 × 8 = 512 C1 unit values instead of 64. This is
expected since for each position, there are units rep-
resenting each of the four orientations [0
◦
, 45
◦
, 90
◦
,
135
◦
].
3 S1 LAYER APPROXIMATIONS
At the S1 layer, several approximations are investi-
gated in order to increase the efficiency of the origi-
nal HMAX model in terms of accuracy and computa-
tional complexity. Each approximation has been eval-
uated independently using SVM and NN classifiers.
3.1 Combined Image-based HMAX
using 2-D Gabor Filters
In this approximation, all unimportant information
such as illumination and expression variations are
eliminated from the image and hence its salient fea-
tures become richer (Sharif et al., 2012). To achieve
this, four main steps are applied to the original image
A of size h × a:
Step 1 – Adaptive Histogram Equalization: In order
to handle the large intensity values to some extent,
adaptive histogram equalization is applied to the orig-
inal image A:
Adapted Image = AdaptHistEq(A) (3)
Step 2 – SVD Decomposition: Singular value decom-
position (SVD) is applied to the image after equaliza-
tion. The concept behind SVD is to break down the
image into the product of three different martices as:
SVD(Adapted
Image) = L × D × R
T
(4)
where L is the orthogonal matrix of size h × h, R
T
is
the transpose of an orthogonal matrix R of size a × a
and D is the diagonal matrix of size h × a.
This decomposition helps the computations to be
more immune to numerical errors, as well as to
expose the substructure of the original image more
clearly and orders their elements from most amount
of variation to the least.
Step 3 – Reconstruction Image: According to
the values of L, D and R, the reconstructed image is
computed as follows:
Reconstructed Image = L ∗ D
α
∗R
T
, (5)
where α is a magnification factor that varies between
1 and 2. The idea to have the value of α vary between
one and two in order to magnify the singular values of
D is to make them invariant to illumination changes.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
140