Figure 3: Example of the aggregated pre-processing that
can be provided by the proposed focal-plane architecture.
injected into the pixel matrix during the image cap-
ture reset (Fern
´
andez-Berni et al., 2011). This reset
energy is first partly consumed by photo-transducing
and then by the dynamics of the charge redistribu-
tion, making the whole operation extremely power-
efficient. Subsequent interconnection patterns could
be established in order to obtain new averaging maps
by joining regions averaged just by the previous pat-
tern. This process can continue as requested by the
algorithm exploiting it, as long as the unavoidable
charge leakage across the chip does not exceed a pre-
scribed limit affecting the precision of the computa-
tions. Too many averaging grids could also limit the
frame rate since, after each step—image capture, first
averaging grid, second averaging grid...—a readout
stage must be performed. This is mandatory due to
the destructive nature of the processing taking place
in every grid with respect to the image representa-
tion provided by the previous one. An illustrative
example is depicted in Fig. 3, where two consecu-
tive rectangular averaging grids are applied over the
original Lena image. Note that only one pixel per
rectangle must be readout as all the pixels within a
particular region hold the same value. This signifi-
cantly reduces the number of analog-to-digital con-
versions with respect to the original captured image.
Note also that the rectangles rendered by the second
grid come from grouping regions of 2×2 rectangles
from the first grid and then averaging again, thus de-
stroying the previous representation. For every grid,
all the signals EN
C
i,i+1
and EN
R
j, j+1
are set to logic ‘1’
but those falling at the boundaries between rectangles
that must be set to logic ‘0’, thereby confining charge
redistribution to the desired regions.
3 VIOLA-JONES ALGORITHM
The Viola-Jones sliding window object detector (Vi-
ola and Jones, 2004) is considered a milestone in real-
time generic object recognition. It certainly requires
a cumbersome previous training, demanding a large
number of cropped samples. But once trained, the
detection stage is fast thanks to the computation of
the integral image, an intermediate image representa-
tion speeding up feature extraction, and to a cascade
of classifiers of progressive complexity. Despite its
simplicity and detection effectiveness, the algorithm
still requires a considerable amount of computational
and memory resources in terms of embedded system
affordability. Different approaches have been pro-
posed in the literature in order to increase the perfor-
mance on a limited hardware infrastructure (Camilli
and Kleihorst, 2011; Jia et al., 2012; Ouyang et al.,
2015). In this paper, we describe a new alternative for
the embedded implementation of the algorithm based
on processing acceleration from the very beginning of
the signal chain, the sensing plane itself.
As just mentioned, feature extraction from the in-
tegral image is one of the keys for the success of
the Viola-Jones detector. The so-called Haar-like fea-
tures simply imply contrast comparison of rectangu-
lar pixel regions across the sliding window. Some
examples are shown in Fig. 4. For each feature, a
weighted sum—or average—of the pixels within the
white rectangles is subtracted from a weighted sum—
or average—of the pixels within the black rectangles.
The integral image, obtained in one pass over the
input image, enables the calculation of these sums
by accessing only four of its accumulated pixels in-
stead of massive processing over the original raw pix-
els. Likewise, contrast normalization for detection in
any lighting conditions demands the computation of
the squared integral image. This normalization pre-
cludes any attempt of skipping the computation of
the integral image by directly evaluating the Haar-
like features from averaging grids as proposed in Sec-
tion 2. Furthermore, the large number of features to
be extracted—e.g., over 2000 for the OpenCV (Brad-
ski, 2000) baseline implementation of Viola-Jones
face detection—would require a great deal of focal-
plane grids per captured image, impacting the reach-
able frame rate. Instead, we propose to exploit a re-
duced number of grids to accelerate the first stage of
the classifier. This stage, the most discriminative of
the cascade, is designed to rapidly reject windows
with very low probability of containing the targeted
object. As explained next, it can be re-defined to make
use of the averaging grids while requiring neither the
integral image nor normalization. Note that a first ad-
vantage of this scheme is that the computation of both
integral images, needed in any case for the rest of the
classifier stages, can be carried out in parallel with the
evaluation of the first stage.
High-level Performance Evaluation of Object Detection based on Massively Parallel Focal-plane Acceleration Requiring Minimum Pixel
Area Overhead
83