vation is to make use of existing cameras and avoid
expensive camera network setup and maintenance.
3 OUTLINE OF OBJECTIVES
Based on the above discussion, the objectives of my
PhD thesis are as follows -
i. Compute unique people count over a certain inter-
val of time from monocular videos.
ii. Make use of existing cameras by avoiding expen-
sive camera setup and maintenance.
iii. Overcome occlusion problems and still obtain re-
markable people count accuracies.
iv. Apply the algorithm on different scenarios and
various kinds of human figures.
4 RESEARCH PROBLEM
My PhD thesis aims to develop a robust algorithm,
the input of which is a monocular video consisting of
human views and the output will be the total unique
count of people within certain duration of the video.
The aim of the algorithm is its application towards
real life problems. To avoid the expensive and also
challenging video camera network system, it works
on the view taken from a single camera. Finally, apart
from dealing with sparse crowds, the algorithm is able
to deal with large as well as dense crowds. Hence, it
is capable of handling occlusions.
5 STATE OF THE ART
The computer vision based algorithms for people
counting from monocular videos are mainly used for
finding out two types of counts - frame based people
count and unique people count. Frame based count is
also known as density estimation.
The frame based people counting algorithms
count people in individual video frames with rea-
sonable accuracy even in the presence of occlu-
sions (Chan et al., 2008; Chan and Vasconcelos, 2012;
Chan and Vasconcelos, 2009; Conte et al., 2010;
Tan et al., 2011; Lempitsky and Zisserman, 2010).
These methods use extracted features from individual
frames and count the number of people in each frame
with the help of machine learning techniques that map
the extracted features to the number of people present
in the frame. But these methods fail to count the
unique number of people present in a video over an
interval of time, as they do not consider the corre-
spondence of the same person over multiple frames.
For example, if there are n people in the first frame
and one person enters, while another person exits the
FOV in the second frame, the frame based counting
will produce n as the people count for the second
frame. However, the unique count of people for the
two frames should be n+ 1.
The computer vision based solutions to unique
people count can be further categorized into three
types: a) the detection and tracking based approach
(Harasse et al., 2005; Kim et al., 2002; Zeng and Ma,
2010), b) the visual feature clustering based approach
(Brostow and Cipolla, 2006; Rabaud and Belongie,
2006) and c) the line of interest (LOI) counting ap-
proach (Ma and A.B.Chan, 2013; Cong et al., 2009;
Kim et al., 2008). The first two individual based anal-
yses are somewhat successful for low density crowds
or overhead camera views, but they are not compe-
tent enough for large crowds. In these types of views,
there is too much occlusion, or people are depicted by
only a few pixels or the situations are too challeng-
ing for tracking. The LOI counting methods are ca-
pable of handling occlusion, but these methods have
received relatively less attention so far.
The detection and tracking based approaches (Ha-
rasse et al., 2005; Kim et al., 2002; Zeng and Ma,
2010) count people by detecting individuals in an im-
age and creating corresponding trajectories by track-
ing them. The number of trajectories in an interval of
time accounts for the number of people. This tech-
nique works well for situations where the object size
is large, the crowd is not too dense and occlusion is
not severe. Large object size helps in the detection
as there are enough image pixels to depict the ob-
ject. Tracking is failsafe for overhead FOVs where
little or no occlusion is present. In case of whole
body views, where partial occlusion is present, par-
ticle filter based tracking can be applied. Applying
the detection-tracking approach becomes difficult in
dense crowds where each person is depicted by only
a few image pixels and people occlude each other in
complex ways. Detection becomes challenging due to
both occlusion and the small sizes of people. Occlu-
sion also poses a difficult challenge for tracking.
The visual feature trajectory clustering meth-
ods (Brostow and Cipolla, 2006; Rabaud and Be-
longie, 2006) cluster feature trajectories that exhibit
coherent motion and the number of clusters is used as
the number of moving objects. This type of method
requires sophisticated trajectory management, like
handling broken feature tracks due to occlusions or
measuring similarities between trajectories of differ-
ent length. Thus, in crowded environments, it is fre-
VISIGRAPP2014-DoctoralConsortium
4