APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION
FROM VIDEO
Kyongil Yoon
Computer Studies, College of Notre Dame of Maryland, Baltimore, MD, 21210, USA
Yaser Yacoob, David Harwood, Larry Davis
Computer Science Department, University of Maryland, College Park, MD, 20742, USA
Keywords:
People recognition, Gallery construction, Appearance Modeling.
Abstract:
An approach for constructing a dynamic gallery of people observed in a video stream is described. We con-
sider two scenarios that require determining the number and identity of participants: outdoor surveillance and
meeting rooms. In these applications face identification is typically not feasible due to the low resolution
across the face. The proposed approach automatically computes an appearance model based on the clothing
of people and employs this model in constructing and matching the gallery of participants. The appearance
model uses color/path-length profile and a robust distance measure based on Kernel Density Estimation (KDE)
and Kullback-Leibler (KL) distance, to evaluate similarity between people and add models to the gallery. A
one-to-one constraint is enforced to correctly match instances to models at each frame. In the meeting room
scenario we exploit the fact that the relative locations of subjects are likely to remain unchanged for the whole
sequence.
1 INTRODUCTION
One aspect of video surveillance of indoor meetings
involves matching a person against a gallery of known
people. Such a gallery is tedious to construct man-
ually; this paper describes an approach to automat-
ically construct a gallery of participants based on
clothing-appearance. The gallery directly supports
the human identification task but it can also be used
to answer questions such as how many people were
observed, when each has appeared and how people
interacted in video sequences.
We propose a method for building a gallery from
a video clip based on clothing-appearance of people.
We assume that people do not change clothing, al-
though our method does tolerate localized appearance
changes. We employ well-known approaches for hu-
man detection in video and focus on the modelling
and matching of human appearance.
We consider two application areas: surveillance
and meetings video. Here, it is difficult to employ
faces for identification since the resolution across the
face is too small and faces typically appear in off-
frontal poses or profile views. Instead, we model the
clothing of people and acquire quantitative models
that support matching.
2 APPEARANCE MODELS
Over a short period of time, we assumed that the ap-
pearance of the person remains unchanged, except for
small, local changes, for instance due to carried pack-
ages or illumination variation.
In (Nakajima et al., 2003), a full-body recognition
system based on color and shape features has been
suggested. They carried out recognition using sup-
port vector machine classifier on several features such
as color histogram, normalized color histogram, com-
bined histogram of shape and color, and local shape
features. However, they did not combine spatial in-
formation with color as we do.
(Elgammal et al., 2002) segmented the human fig-
ure into three blobs and computed a separate color
distribution for each blob. Specifically, the head,
torso, and legs were segmented by assuming that the
person appears in an upright pose. Although they sep-
arated the body into parts, most of the spatial informa-
336
Yoon K., Yacoob Y., Harwood D. and Davis L. (2007).
APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO.
In Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, pages 332-337
DOI: 10.5220/0002142103320337
Copyright
c
SciTePress
Figure 1: This is a simplified drawing of human body by
Leonardo da Vinci. The red lines show shortest-path. The
path-length is the distance from the top of the head to a
given point on the path. The path-length to the end of hand
or foot is relatively unchanged by the motion of the arms
and legs.
tion is lost.
To overcome this problem, we introduced a sim-
ple, efficient feature, path-length, which represents
the spatial information of a pixel with respect to a ref-
erence point on the person’s body. The path-length
is robust to changes in human posture and limb posi-
tions.
The path-length of a pixel is defined as the nor-
malized length of the shortest path from the top cen-
ter pixel (usually top of the head) inside a silhouette.
Fig. 1 illustrates the idea of path-length.
In addition to path-length, clothing color informa-
tion is employed to model appearance. The brightness
(Br) defined as the sum of the three color components
in (Alexander and Buxton, 2001), and two color pro-
portions, red and green are used.
red =
RED
Br
,green =
GREEN
Br
.
3 MATCHING METRIC
The foreground region representing a person is used
to construct an appearance model that is compared
to models in the gallery. The distance between the
current appearance and existing appearance models in
the gallery determines if a new model should be added
to the gallery or not.
3.1 Distance between Models
Our distance measure is computed based on kernel
density estimation (KDE) and Kullback-Leibler (KL)
distance. Kernel density estimation is a general non-
parametric technique to estimate an underlying den-
sity using data points. In KDE, the probability for a
given feature x is estimated as
ˆ
f(x) =
i
α
i
K(x x
i
),
where K is a kernel function centered at data points x
i
,
i = 1...n, and a
i
are weighting coefficients. Typically,
the Gaussian is used as a kernel function, and uniform
weights are used, i.e., i = 1/n. Theoretically, suit-
able kernel density estimators converge to any density
functions if enough samples are provided (Silverman,
1986) (Duda et al., 2000).
Assume that we are to compute the distance be-
tween Model M = {x
i
|i = 1,...,N
p
}, where N
p
is the
number of data points in the appearance model, and
current instance I = {y
i
|i = 1,...,N
q
}, where N
q
is the
number of data points in the current instance. The
estimated probability distribution of model M is
ˆ
f
M
(x) =
N
p
i=1
1
N
p
K
σ
(xx
i
) (1)
and the distribution of the current instance, I is (2).
ˆ
f
I
(x) =
N
q
i=1
1
N
q
K
σ
(xy
i
). (2)
The distance between the instance and the model can
be thought of as the distance between two distribu-
tions represented by KDE,
ˆ
f
M
(x) and
ˆ
f
I
(x). The
two most frequently used methods for comparing two
distributions are Chi-Square test and Kolmogorov-
Smirnov test (Press et al., 1988). Neither method
is appropriate for our models. The Kolmogorov-
Smirnov test is not suitable for our four dimensional
model. The Chi-Square test involves dividing the data
points into a number of bins; it is a good approxi-
mation when the number of bins is large ( 1), and
number of events in each bin is large ( 1). How-
ever, for human appearance, the color distribution is
very skewed, leading in many empty bins.
We instead use the Kullback-Leibler (KL) dis-
tance to compare
ˆ
f
M
(x) and
ˆ
f
I
(x). The KL dis-
tance is defined on two probability distributions in
(Kapur and Kesavan, 1992), (Kullback and Leibler,
1951), (Cover and Thomas, 1991). For any given
point, we can compute a pair of probabilities using
the two estimated densities. Assume that there is a set
of sample points, S = {s
i
|i = 1,...,n}, where n is the
number of sample points. Then likelihood values for
the sample points can be computed using
p
i
= ˆp(s
i
) =
1
N
p
N
p
j=1
K
σ
(s
i
x
j
),
q
i
= ˆq(s
i
) =
1
N
q
N
q
j=1
K
σ
(s
i
y
j
).
APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO
337
To compute the KL distance on those values, we nor-
malize p
i
and q
i
as following
ˆp
i
=
p
i
n
j=1
p
j
, ˆq
i
=
q
i
n
j=1
q
j
The KL distance is defined as
d
kl
= d( ˆq, ˆp) =
n
i=1
ˆq
i
log
ˆq
i
ˆp
i
How to select S can be critical. To maximize the dif-
ference between p
i
s and q
i
s, it is best to use all the
points in I; however, the computational cost can be
prohibitive. Instead, by sampling points from I, we
typically get equivalent results as long as the sam-
pling process is reasonable. We sample points uni-
formly along path-length values. Practically, when
we choose 100 points randomly spaced at 1% seg-
ments of path-length, the results are equivalent to us-
ing all the data points.
By examining the KL distance, we can measure
how different two distributions are. However, because
p
i
and q
i
are normalized, this can be problematical.
For instance, when
ˆ
f
M
(x) and
ˆ
f
I
(x) are uniform dis-
tributions over different ranges, then all the p
i
s are
very low, and all the q
i
s are very high. Although two
distributions are quite different, after normalization ˆp
i
and ˆq
i
form almost identical distributions and d
kl
is
approximately and misleadingly 0.
To overcome this limitation, we introduce an ad-
ditional distance measure which represents a quanti-
tative difference between p
i
s and q
i
s as follows:
d
r
= |1 (
¯p
¯q
)| (3)
where ¯p = (
p
i
)/n and ¯q = (
q
i
)/n.
3.2 Robust Distance Measure
Human appearance in video streams varies over time.
In outdoor scenes, lighting, human pose variation and
carried objects may lead to changes in the foreground
region. To cope with such variations we employ a
robust estimation norm that adjusts the weighting of
points within the distance metric based on whether
points are inliers or outliers.
For the robust estimation, we employ the general
M-estimator of (Huber, 1977), which minimizes the
objective function,
n
i=1
ρ(e
i
) =
n
i=1
ρ(y
i
x
i
T
b) (4)
where x
i
s are independent variables, y
i
s are data
points, b is a coefficient vector, ρ is the influence
function, and n is the number of data points.
If we define the weight function ω(e) = ρ
(e)/e,
and let ω
i
= ω(e
i
). Then we need to solve the follow-
ing equation to minimize (4)
n
i=1
ω
i
(y
i
x
T
i
b)x
T
i
= 0 (5)
In our approach, we define a new feature, δ
i
using
p
i
and q
i
for each sample point, s
i
, :
δ
i
=
|
q
i
p
i
|
max(p
i
,q
i
)
When the current instance is correctly matched to a
model, most p
i
s are similar to q
i
s leading the δ
i
s to
be close to 0. On the other hand, when the instance
and model are mismatched, most δ
i
s will be greater
than 0. The mean of δ
i
will represent how well the
current instance is matched to the model. We apply
the robust fitting (5) to compute the robust mean of
the δ
i
s, µ; it can be written as
n
i=1
ω
i
(δ
i
µ) = 0
Notice that weights are designed to minimize the in-
fluence of outliers. In other words, the weight of each
data point depends on how far the point is from the
mean. Data points near to the estimated mean get high
weight. Points that are far from the mean have smaller
weights.
We used the iteratively re-weighted least square
(IRLS) method using the bisqaure weight function to
solve the equation to get a robust mean as in (Cole-
man et al., 1980) and (Fox, 2002).
The final weights at the last iteration after the es-
timated mean converges were investigated to find in-
liers. Only data points with the weight greater than
a certain threshold value are regarded as inliers. The
two distances, d
r
and d
kl
, are recomputed using only
inliers. Fig. 2 shows examples of outliers and inliers
as determined using robust fitting method for a sam-
ple region that has been manually altered by changing
its color.
4 SPATIAL ANALYSIS
Sometimes it is possible to improve the accuracy of
the models in the gallery and the matching perfor-
mance by utilizing the relative order of participants.
We perform this as follows.
For each model, M
i
, we compute an adjacency
matrix, F
i
that captures the frequency of spatial or-
dering among models. An adjacency matrix, F
i
is
m × n, where n is the number of models and m in-
dexes relative positions. For example, if N is the
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
338
Figure 2: Detection of outliers. The image in the first col-
umn is the model image. Second column images are used as
instances. To synthesize outliers, a 15% size block with red
color pixels is created. In the third column the inliers and
outliers are shown as white and black points, respectively.
maximum number of people in one frame and peo-
ple are arranged in a ”linear” configuration, then m =
2 (N 1).
To build the adjacency matrix F
i
, all the frames
which have a person matched to model M
i
are em-
ployed. The ( j,k)-th element of F
i
is the frequency of
model M
k
at the relative horizontal position, pos( j).
pos(.) is defined as
pos( j) =
j
m
2
1 if j <
m
2
j
m
2
otherwise
(6)
The upper half of an adjacency matrix, F
i
, repre-
sents the frequencies of models to the ”left” of M
i
; the
bottom the ”right” side.
The difference between adjacency matrices repre-
sents how similar two models are to each other. To
compute the distance between adjacency matrices, the
sum of absolute differences is used. Before comput-
ing d
ij
, each F
i
is normalized by the max
j,k
((F
i
)
j,k
),
so we have
d
ij
=
n
k=1
n
l=1
(F
i
)
k,l
(F
j
)
k,l
(7)
Fig. 3 shows the adjacency matrices from the ex-
periment described in detail in section 5.2, 15 mod-
els were found after the first pass. Distances between
adjacency matrices are computed, and pairs with dis-
tance less than a threshold can be merged into one.
5 EXPERIMENTS
We present two experiments. The first was conducted
on four video clips collected at different locations and
under different illumination conditions. The second
experiment analyzes an 18 minute long video clip of
a meeting. In this experiment, a face detection al-
gorithm was used to determine an approximate torso
Model 1 Model 2 Model 3 Model 4
Model 5 Model 6 Model 7 Model 8
Model 9 Model 10 Model 11 Model 12
Model 13 Model 14 Model 15
Figure 3: Adjacency matrices for 15 models in the experi-
ment of section 5.2.
Figure 4: Sample frames of the full body gallery test.
area. In each experiment, we show the final gallery
and the matching results based on the gallery.
The gallery construction process consists of two
passes.
1. Construct an initial gallery. From an empty set,
a gallery is built while processing all the frames.
After this pass, the gallery has all the tentative
models.
2. Refine the gallery. In this pass, redundant models
are removed based on frequency and spatial anal-
ysis of the matching result, and a more compact
and accurate gallery is built.
5.1 Full Body Gallery - Experiment 1
For this experiment, 1212 frames were collected from
four different video clips. Three clips were outdoor
video, and one clip was captured in a room monitor-
ing people coming and going. The number of people
in the test set is 12. We employed a background sub-
traction algorithm to detect the foreground regions.
The detected regions are considered as full-body ap-
pearance of human. Fig. 4 shows some images in this
test set.
After the first pass, we have 24 models in the
gallery. The second pass uses the static gallery of
the 24 models. In this experiment, most redundancy
APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO
339
Table 1: Matching result - Full body.
Num of Correct Incorrect Match
Gallery Models Match Match Rate
Initial 24 1609 291 85.1%
Refined 16 1583 307 83.7%
Figure 5: The final gallery built with a test set of Fig. 4. The
number of models in the gallery is 16.
comes from the inaccuracies of human silhouettes
created by background subtraction. After the second
pass, we have a final gallery of 16 models as shown in
Fig. 5. All 12 people have models. 2 people have two
models ((M
3
, M
9
) and (M
14
, M
15
)) and 1 person has 3
models respectively (M
1
, M
7
, M
8
).
In this data set, 1890 foreground areas are detected
from the 1212 frames. Using the the final gallery with
16 models, we could match 1583 regions correctly,
while 307 are mismatched (83.7% success). When we
use the 24 model gallery before removing redundant
models, the number of correct matches is 1609 and
291 regions are not matched correctly (85.1% suc-
cess). The representation power of the gallery is de-
pendent on data set and foreground segmentation re-
sults. When using the same segmentation results, the
final gallery has similar representation power com-
pared to the gallery before redundant model removal
(Table 1).
5.2 Upper Body Gallery - Experiment 2
An 18 minute long video clip which has 8 people is
used for this experiment. Although the number of to-
tal frames is 32400, only one frame out of every five
Figure 6: Some frames showing matching results with the
final gallery.
Figure 7: Sample frames from the video clip used in upper
body gallery test.
Table 2: Frequency of each model.
Model M
1
M
2
M
3
M
4
M
5
Freq. 910 703 370 277 1945
Model M
6
M
7
M
8
M
9
M
10
Freq. 426 1892 2997 9 3359
Model M
11
M
12
M
13
M
14
M
15
Freq. 97 221 16 18 7
frames were processed. This video clip was captured
in a meeting room, and people remain seated with-
out position changes. The cameras pan and tilt as the
meeting progresses, so that at any one time we see a
different subset of the participants. Only the upper
bodies of people are seen.
We employ a face detection algorithm to locate
people (Viola and Jones, 2001). Based on the detected
faces, the torso areas were computed and appearance
matching was conducted. Since the relative positions
between people remain unchanged for the entire clip,
we perform the spatial analysis described in section 4.
Several frames are shown in Fig. 7. The first pass
constructed a 15 model gallery excluding false alarms
from the face detector.
In the second pass, the spatial analysis of relative
horizontal positions was carried out. The adjacency
matrices of the 15 models were shown in Fig. 3.
Before calculating the differences between adja-
cency matrices, the total frequency for each model
is used to eliminate some models. The total number
of face occurrences is 13709, and some models have
very low frequency. Table 2 shows the frequency of
each model.
As seen in Table 2, M
9
, M
13
, M
14
, M
15
can be
eliminated since their frequencies are very low. Next,
by thresholding the differences between adjacency
matrices, we select pairs of models, which can be
merged into one.
The final gallery has 8 models. In the video clip,
although there are nine people appearing, the ninth
person shows only side view and she was not detected
by the face recognition algorithm. The eighth person
was not included in the gallery, and two models were
found for the first person. Table 3 shows the gallery
we constructed. The merged models are shown in
parentheses. Fig. 9 shows some of the matching re-
SIGMAP 2007 - International Conference on Signal Processing and Multimedia Applications
340
Table 3: Final Gallery.
Person Model
P
1
M
3
,M
4
P
2
M
2
P
3
(M
1
,M
12
)
P
4
(M
5
,M
9
,M
11
,M
13
,M
15
)
P
5
(M
6
,M
7
)
P
6
M
10
P
7
(M
8
,M
14
)
P
8
NONE
Figure 8: 8 models in the final gallery after the spatial anal-
ysis.
sults using the final gallery. To investigate the iden-
tification accuracy of matching, we randomly chose
100 frames which were found to have 210 face areas.
Table 4 summarized the result. Just like in the exper-
iment in Sec. 5.1, even with the smaller number of
models the gallery shows the similar performance.
6 CONCLUSION AND FUTURE
WORK
We proposed an approach for constructing a dynamic
gallery of people from a video clip or a set of frame
images based on appearance model using color/path-
length profile. Kullback-Leibler distance is used to
robustly compare models and a one-to-one constraint
is enforced when more than one instance is present
and matched in a frame. When the order of people
rarely changes, the relative spatial order is analyzed
and used to reduce the redundant models from the
gallery.
There is trade-off between representation power
and compactness of gallery. Using multiple key-
frames to build a model can help to give more rep-
resentation power to the models. One of our future
Table 4: Matching result - Upper body.
Num of Correct Incorrect Match
Gallery Models Match Match Rate
Initial 15 198 12 94.3%
Refined 8 194 16 92.4%
Figure 9: Some frames showing matching results with the
final gallery in the second experiment.
work is to find an effective method to accumulate the
information of multiple frames into one model.
REFERENCES
Alexander, D. C. and Buxton, B. F. (2001). Statistical mod-
eling of colour data. International Journal of Com-
puter Vision, 44(2):87–109.
Coleman, D., Holland, P., Kaden, N., Klema, V., and Peters,
S. C. (1980). A system of subroutines for iteratively
reweighted least squares computations. ACM Trans.
Math. Softw., 6(3):327–336.
Cover, T. M. and Thomas, J. A. (1991). Elements of Infor-
mation Theory. New York: Wiley.
Duda, R. O., Stork, D., and Hart, P. E. (2000). Pattern Clas-
sification. John Wiley and Sons Inc.
Elgammal, A., Duraiswami, R., Harwood, D., and Davis,
L. S. (2002). Background and foreground model-
ing using non-parametric kernel density estimation
for visual surveillance. Proceedings of the IEEE,
90(7):1151–1163.
Fox, J. (2002). Robust regression: Appendix to an r and
s-plus companion to applied regression.
Huber, P. J. (1977). Robust Statistical Procedures. Society
for Industrial and Applied Mathematics.
Kapur, J. N. and Kesavan, H. K. (1992). Entropy Optimiza-
tion Principles with Applications. Academic Press.
Kullback, S. and Leibler, R. A. (1951). On information and
sufficiency. Annals of Mathematical Statistics, 22:79–
86.
Nakajima, C., Pontil, M., Heisele, B., and Poggio, T.
(2003). Full-body person recognition system. Pattern
Recognition, 36(9):1997–2006.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-
nery, B. P. (1988). Numerical Recipes in C: The Art of
Scientific Computing. Cambridge University Press.
Silverman, B. W. (1986). Density estimation for statistics
and data analysis. Chapman & Hall, New York.
Viola, P. A. and Jones, M. J. (2001). Rapid object detection
using a boosted cascade of simple features. In CVPR
(1), pages 511–518.
APPEARANCE-BASED HUMAN GALLERY CONSTRUCTION FROM VIDEO
341