context. We present a recognition framework that
jointly captures the individual activity and its ac-
tivity relationships with its neighbors.
The rest of the paper is organized as follows. We
review related work on activity analysis in crowded
scenes in section 2. Section 3 describes the human ac-
tivity descriptor in group context along with the con-
ditional random field model used to address the activ-
ity recognition task. Experimental results and evalu-
ations are presented in Section 4. Finally, Section 5
concludes the paper.
2 RELATED WORK
In this section, we review related work on human ac-
tivity analysis in crowded scenes that use a top-down
or bottom-up approach. In bottom-up approaches,
group context is used to differentiate ambiguous ac-
tivities e.g. standing and talking, which are normally
represented by the same local descriptors. Most ap-
proaches integrate contextual information by propos-
ing a new feature descriptor extracted from an in-
dividual and its surrounding area. Lan et al. (Lan
et al., 2012b) propose an Action Context (AC) de-
scriptor capturing the activity of the focal person and
the behavior of other people nearby. AC descriptor is
computed by concatenating the focal person’s action
probability vector (computed using Bag-of-Words ap-
proach with SVM classifier), and the context action
probability vectors capturing the activities of other
neighborhood people. However, this AC descriptor
only can capture spatial proximity information by us-
ing ‘near by’ context. Considering a more sophisti-
cated contextual descriptor, Choi et al. (Choi et al.,
2009) propose Spatio-Temporal Volume (STV) de-
scriptor, which captures spatial distribution of pose
and motion of individuals in a scene to analyze group
activity. STV descriptor centered on a person of in-
terest or an anchor is used for classification of the
group activity. The descriptor is a histogram of peo-
ple and their poses in different spatial bins around the
anchor. These histograms are concatenated over the
video to capture the temporal nature of the activities.
SVM using pyramid kernels is used for classification.
The same descriptor is leveraged in (Choi et al., 2011)
but Random Forest classification is used for group ac-
tivity analysis. In addition, random forest structure
is used to randomly sample the spatio-temporal re-
gions to pick most discriminative features. Recently,
Amer et al. (Amer and Todorovic, 2011) introduced
Bags-of-Right-Detections (BORD) seeking to remove
noisy people detection in groups. BORD is a his-
togram of human poses detected in a spatio-temporal
neighborhood centered at a point in the video volume.
The BORD is not computed from all neighborhood
people, but only from those detections that are consid-
ered to take part in the target activity. A two-tier MAP
inference algorithm is proposed for the final recogni-
tion step.
In contrast to bottom-up approaches, top-down
methods model the entire group as a whole rather than
each individual separately. Khan and Shah (Khan and
Shah, 2005) use rigidity formulation to represent pa-
rade activities. They modeled group shape as a 3D
polygon with each corner representing a participating
person. The tracks from person in group are treated
as tracks of feature points in a 3D polygon. Using
rank of track matrix, activities are classified as pa-
rade or just random crowds. Vaswani et al. (Vaswani
et al., 2003) model an activity using a polygon and
its deformation over time. Each person in the group
is treated as a point on the polygon. The model is
applied to abnormality detection in a crowded scene.
Multi-camera multi-target tracks are used to generate
dissimilarity measure between people, which in turn
are used to cluster them into groups in (Chang et al.,
2010). Group activities are recognized by treating the
group as an entity and analyzing the behavior of the
group over time. Mehran et al. (Mehran et al., 2009)
built a ‘Bag-of-Forces’ model of the movements of
people using social force model in a video frame to
detect abnormal crowd behavior. Close to top-down
approach, Ryoo et al. (Ryoo and Aggarwal, 2011)
present an approach that splits group activity into sub-
events like person activity and person to person in-
teractions. Each portion is represented using context
free grammar and the probability of their occurrence
given a group activity or time periods. A hierarchical
recognition algorithm based on Markov Chain Monte
Carlo density sampling technique is developed. The
technique identifies the groups and group activity si-
multaneously.
Recently, several approaches that leverage social
signaling cues for analyzing crowded scenes have
been proposed. Group activities can be better inferred
from valuable social interactions cues between peo-
ple present in the scene. Several approaches are pro-
posed to identify meaningful group from the videos
using spatial and orientational arrangement of peo-
ple in the scene as a cue based on social signaling
principles (Farenzena et al., 2009b; Farenzena et al.,
2009a; Tran et al., 2014). Lan et al. (Lan et al., 2012a)
present a bottom-up approach integrating social role
analysis to understand activities in crowd scene. Dif-
ferent from above approaches, our approach takes
advantage of both bottom-up and top-down mecha-
nisms by designing a group context activity descrip-
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
6