3.2.1 Pose Similarity Feature Space
We transform the whole problem into a new feature
space termed as pose similarity feature space
(PSFS). This PSFS is derived by computing
similarities between LESH features, coming from
same pose examples and similarities between
features from all the different pose examples.
As measure of similarity, we use modified K-L
divergence which is numerically stable, symmetric
and robust with respect to noise and size of
histogram bins. It actually gives a measure of
dissimilarity between two histograms. Thus low
values means more similar. It is defined as
hk
i,r i,r
d(H, K) = η (h log + k log )
r
i,r i,r
r
i
mm
i,r i,r
∑∑
(6)
Where, subscript ‘r’ runs over total number of
regions(partitions) and ‘i’ over number of bins in
each corresponding local histogram h and k, see
section 2.2.1, ‘m’ is the corresponding bin’s mean
and ‘η
r
’
is used as a provision to weigh each region
of the face while computing similarity scores. This
could be used, for instance, in overcoming the
problems due to expressions, by assigning a lower
weight to regions that are mostly affected. In our
experiments, for now, this η is set to 1.
For each example in our training set, we
compute these similarities with the rest of the
examples in the same pose on derived LESH
features. Concatenating them, give rise to an intra
pose ‘IP’ (same pose) similarity vector. Similarly
computing these similarities for each example with
all other examples in a different pose give rise to an
extra pose ‘EP’ similarity vector. Thus each example
is now represented in a PSFS as a function of its
similarities by these IP or EP vectors.
Note however, the dimensionality of this PSFS
is a direct function of the total number of examples
per pose in the training set. Therefore to put upper
limit on the dimensionality of this derived PSFS and
also to generate many representative IP and EP
vectors for a test face, as explained shortly, we
partition our training sets into some disjoint subsets
in such a way that each subset has same number of
subjects in each pose. To understand it better,
consider, for example, our training set comprising of
15 subjects, where each subject is in 7 different
illumination and expression imaging conditions in
each of the 9 poses, see figure 2. Therefore we have
15x7(105) examples per pose.
Deriving a PSFS directly means a 105
dimensional feature space, while partitioning it into
some disjoint subsets, such as each subset has all the
15 subjects but in some different combination of the
imaging condition, would yield a 15 dimensional
features space while still representing all the
variations we want to model.
3.2.2 Formal Description of our Approach
Formally, our approach is that we first partition the
training set into ‘k’ disjoint subsets (all N training
examples per pose per subset), the subsets are
disjoint in terms of the 7 imaging conditions (chosen
such as each subject is at a different imaging
condition in that subset).
In each subset, we then compute for each
example, its similarity to the rest of the examples in
the same pose on derived LESH features. Thus for
‘N’ examples per pose, we compute ‘N-1’
similarities for each example, concatenating them,
give rise to a ‘N-1’ dimensional intra-pose (IP)
similarity feature vector for each of the N examples.
Extra-pose (EP) vectors are obtained similarly by
computing these similarities between each example
in one pose with n-1 examples in a different pose by
leaving the same subject each time.
Thus we will have
IP samples and
EP samples for training. Where
‘N’ is number of examples/pose and ‘P’ is total
number of pose.
Although there will be a large number of EP
samples as compared to IP in the derived PSFS but
we note that, IP samples tend to have low values as
compared to EP and form a compact cluster in some
sub-space of the PSFS.
This is validated in Figure 3 which shows a 3-D
scatter plot of IP and EP samples from one of the
subset, by randomly choosing 3 features from IP and
EP similarity vectors. Note that IP samples are
depicted from all of the 9 poses while only those EP
samples are depicted which are computed among
large pose variations, such as between frontal and
left/right profile view or between left and right
profile view. The scatter plot is shown in logarithmic
scale for better viewing.
Figure 3 provides an intuitive look at how the
problem is easily separable when there are large
pose variations, while EP samples coming from
nearest pose examples can be seen as causing a
marginal overlap with the IP class.
The training set is used as a gallery and thus for
a test face, computing its similarity with all of the
examples in each pose in each subset of the gallery
produces many representative similarity vectors for
that test image. Therefore there is a good chance that
more of the similarity vectors, coming from where
the pose of the test face and gallery are same, falls in
HEAD POSE ESTIMATION IN FACE RECOGNITION ACROSS POSE SCENARIOS
239