A Group Contextual Model for Activity Recognition in Crowded Scenes

Khai N. Tran, Xu Yan, Ioannis A. Kakadiaris and Shishir K. Shah

University of Houston, Department of Computer Science, 4800 Calhoun Rd, TX 77204, U.S.A.

Keywords:

Group Context Activity, Activity Recognition, Social Interaction.

Abstract:

This paper presents an efﬁcient framework for activity recognition based on analyzing group context in

crowded scenes. We use graph based clustering algorithm to discover interacting groups using top-down

mechanism. Using discovered interacting groups, we propose a new group context activity descriptor captur-

ing not only the focal person’s activity but also behaviors of its neighbors. For a high-level of understanding

of human activities, we propose a random ﬁeld model to encode activity relationships between people in the

scene. We evaluate our approach on two public benchmark datasets. The results of both the steps show that our

method achieves recognition rates comparable to state-of-the-art methods for activity recognition in crowded

scenes.

1 INTRODUCTION

Typically, in crowded scenes, people are engaged

in multiple activities resulting from inter and intra-

group interactions. This poses a rather challenging

problem in activity recognition due to variations in

the number of people involved, and more speciﬁcally

the different human actions and social interactions

exhibited within people and groups (Ryoo and

Aggarwal, 2011; Tran, 2013). Understanding groups

and their activities is not limited to only analyzing

movements of individuals in group. The environment

in which these groups exist provides important

contextual information that can be invaluable in rec-

ognizing activities in crowded scenes. Perspectives

from sociology, psychology and computer vision

suggest that human activities can be understood

by investigating a subject in the context of social

signaling constraints (Smith et al., 2008; Helbing and

Moln

ar, 1995; Cristani et al., 2011). Exploring the

spatial and directional relationships between people

can facilitate the detection of social interactions in

a group. Thus, activity analysis in crowded scenes

can often be considered a multi-step process, one

that involves individual person activity, individuals

forming meaningful groups, interaction between

individuals and interactions between groups. In

general, the approaches to group activity analysis

can be classiﬁed into two categories: bottom-up

and top-down. The bottom-up approaches rely on

recognizing activity of each individual in a group.

Vice versa, top-down approaches recognize group

activity by analyzing at the group level rather than

at the individual level. Bottom-up approaches show

the understanding of activities at the individual level,

however they are limited in recognizing activities

at group level. Top-down approaches show better

contextual understanding of activities in groups but

they are not robust enough to recognize activities at

the individual level.

In this paper, we develop a social context frame-

work for recognizing human activities in crowded

scenes by taking advantage of both top-down and

bottom-up approaches. Our hybrid framework local-

izes groups through social interaction analysis using

a top-down approach and analyzes individual activity

based on social context within the group using a

bottom-up mechanism. We propose a novel group

context activity descriptor capturing characteristics of

individual activity with respect to the behavior of its

neighbors along with an efﬁcient conditional random

ﬁeld model to learn and classify human activities in

crowded scenes.

The main contributions of our work are:

1. A Group Context Activity Descriptor. We use

a top-down approach to dynamically localize in-

teracting groups capturing behaviors of individu-

als. We form a group context activity descriptor

that is a combination of individual activity and its

neighbor’s behavior, represented using the Bag-

of-Words (BoW) representation.

2. An Efﬁcient Conditional Random Field Frame-

work to Learn and Classify Human Activities in

Tran K., Yan X., Kakadiaris I. and Shah S..

A Group Contextual Model for Activity Recognition in Crowded Scenes.

DOI: 10.5220/0005258600050012

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 5-12

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

context. We present a recognition framework that

jointly captures the individual activity and its ac-

tivity relationships with its neighbors.

The rest of the paper is organized as follows. We

review related work on activity analysis in crowded

scenes in section 2. Section 3 describes the human ac-

tivity descriptor in group context along with the con-

ditional random ﬁeld model used to address the activ-

ity recognition task. Experimental results and evalu-

ations are presented in Section 4. Finally, Section 5

concludes the paper.

2 RELATED WORK

In this section, we review related work on human ac-

tivity analysis in crowded scenes that use a top-down

or bottom-up approach. In bottom-up approaches,

group context is used to differentiate ambiguous ac-

tivities e.g. standing and talking, which are normally

represented by the same local descriptors. Most ap-

proaches integrate contextual information by propos-

ing a new feature descriptor extracted from an in-

dividual and its surrounding area. Lan et al. (Lan

et al., 2012b) propose an Action Context (AC) de-

scriptor capturing the activity of the focal person and

the behavior of other people nearby. AC descriptor is

computed by concatenating the focal person’s action

probability vector (computed using Bag-of-Words ap-

proach with SVM classiﬁer), and the context action

probability vectors capturing the activities of other

neighborhood people. However, this AC descriptor

only can capture spatial proximity information by us-

ing ‘near by’ context. Considering a more sophisti-

cated contextual descriptor, Choi et al. (Choi et al.,

2009) propose Spatio-Temporal Volume (STV) de-

scriptor, which captures spatial distribution of pose

and motion of individuals in a scene to analyze group

activity. STV descriptor centered on a person of in-

terest or an anchor is used for classiﬁcation of the

group activity. The descriptor is a histogram of peo-

ple and their poses in different spatial bins around the

anchor. These histograms are concatenated over the

video to capture the temporal nature of the activities.

SVM using pyramid kernels is used for classiﬁcation.

The same descriptor is leveraged in (Choi et al., 2011)

but Random Forest classiﬁcation is used for group ac-

tivity analysis. In addition, random forest structure

is used to randomly sample the spatio-temporal re-

gions to pick most discriminative features. Recently,

Amer et al. (Amer and Todorovic, 2011) introduced

Bags-of-Right-Detections (BORD) seeking to remove

noisy people detection in groups. BORD is a his-

togram of human poses detected in a spatio-temporal

neighborhood centered at a point in the video volume.

The BORD is not computed from all neighborhood

people, but only from those detections that are consid-

ered to take part in the target activity. A two-tier MAP

inference algorithm is proposed for the ﬁnal recogni-

tion step.

In contrast to bottom-up approaches, top-down

methods model the entire group as a whole rather than

each individual separately. Khan and Shah (Khan and

Shah, 2005) use rigidity formulation to represent pa-

rade activities. They modeled group shape as a 3D

polygon with each corner representing a participating

person. The tracks from person in group are treated

as tracks of feature points in a 3D polygon. Using

rank of track matrix, activities are classiﬁed as pa-

rade or just random crowds. Vaswani et al. (Vaswani

et al., 2003) model an activity using a polygon and

its deformation over time. Each person in the group

is treated as a point on the polygon. The model is

applied to abnormality detection in a crowded scene.

Multi-camera multi-target tracks are used to generate

dissimilarity measure between people, which in turn

are used to cluster them into groups in (Chang et al.,

2010). Group activities are recognized by treating the

group as an entity and analyzing the behavior of the

group over time. Mehran et al. (Mehran et al., 2009)

built a ‘Bag-of-Forces’ model of the movements of

people using social force model in a video frame to

detect abnormal crowd behavior. Close to top-down

approach, Ryoo et al. (Ryoo and Aggarwal, 2011)

present an approach that splits group activity into sub-

events like person activity and person to person in-

teractions. Each portion is represented using context

free grammar and the probability of their occurrence

given a group activity or time periods. A hierarchical

recognition algorithm based on Markov Chain Monte

Carlo density sampling technique is developed. The

technique identiﬁes the groups and group activity si-

multaneously.

Recently, several approaches that leverage social

signaling cues for analyzing crowded scenes have

been proposed. Group activities can be better inferred

from valuable social interactions cues between peo-

ple present in the scene. Several approaches are pro-

posed to identify meaningful group from the videos

using spatial and orientational arrangement of peo-

ple in the scene as a cue based on social signaling

principles (Farenzena et al., 2009b; Farenzena et al.,

2009a; Tran et al., 2014). Lan et al. (Lan et al., 2012a)

present a bottom-up approach integrating social role

analysis to understand activities in crowd scene. Dif-

ferent from above approaches, our approach takes

advantage of both bottom-up and top-down mecha-

nisms by designing a group context activity descrip-

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Figure 1: Illustration of group discovery. Human interactions in a group is represented as an undirected edge-weighted graph.

Dominant set based clustering algorithm is used to localize interacting groups. There are four discovered groups from the

scene: {5,6,7,8}, {1,2}, {3} and {4}.

tor capturing individual activity and behavior of its

neighbors within its groups. Once meaningful groups

are identiﬁed from the videos by using top-down ap-

proach, group context activity descriptor is built for

each individual in discovered group. Using this de-

scriptor a random ﬁeld model is built to recognize in-

dividual activities using a bottom-up approach.

3 APPROACH

In this paper, we mainly focus on recognizing human

activities in crowded scenes. Thus, we assume that

people in a crowded scene have been detected and the

trajectories of people in 3D space and the head poses

are available or methods such as (Choi et al., 2009;

Hoiem et al., 2006) can be used to obtain the same.

3.1 Group Discovery in Crowded Scene

In general, the analysis of complex activity in

crowded scene is a challenging task, due to noisy ob-

servations and unobserved communication between

people. In order to understand which people in the

scene form meaningful groups, we employ a top-

down approach proposed in (Tran et al., 2014) to dis-

cover socially interacting groups in the scene. This

top-down approach basically represents all detected

people as a graph where each vertex represents one

person and weighted edges describe the social in-

teraction between any two people in a group. The

dominant set based clustering algorithm is used to

discover the interacting groups (Pavan and Pelillo,

2007). Fig.1 depicts the overview of group discovery

in the crowded scene.

3.2 Model Formulation

Given a set N = {1, ..., n} of all the people detected

in the scene, let x = {x

, x

, ..., x

} be the set of peo-

ple activity descriptors; a = {a

, a

, ..., a

} be the set

of individual activity labels, where x

is feature vec-

tor and a

∈ A is activity label associated with person

i ∈ N (A is set of all possible activity labels). As a re-

sult of clustering people in the scene to different inter-

acting groups, let us deﬁne G = {G

, G

, ..., G

} as

the set of discovered groups where G

is set of people

clustered in group c and ∪

c=1

= N. We introduce

a standard conditional random ﬁeld model to learn

the strength of the interactions between activities in

discovered groups. The activity interaction is condi-

tioned on image evidence, so that the model not only

takes into account which activity each person is en-

gaged in, but also higher-order dependencies between

activities. Our model is represented as:

Ψ(a, x) =

∑

i∈N

φ(a

, x

) +

∑

c=1

∑

(i, j)∈G

φ(a

, a

) (1)

where φ(a

, x

) is a singleton factor that models the

probability of person i’s activity label a

∈ A given

its feature vector x

. φ(a

, a

) is the pairwise factor

that models the probability between pair of individ-

ual activities a

and a

, where (i, j) belong to the same

group G

discovered by using top-down approach de-

scribed in section 3.1. A graphical illustration of our

model discovering meaningful groups and formula-

AGroupContextualModelforActivityRecognitioninCrowdedScenes

Figure 2: Illustration of our conditional random ﬁeld model

for each discovered interacting group. The activity-to-

activity relationships in each group are represented by

dashed lines.

tion of conditional random ﬁeld model is shown in

Fig.2.

The model described in Eq.1 only captures high-

level activity-to-activity relationships of people in dis-

covered groups. This limit us from analyzing the de-

tailed interaction of individual activity and the behav-

ior of its neighbors within a group. Thus we introduce

a low-level group context activity descriptor that en-

codes detailed individual activity interactions within

a group.

3.3 Group Context Activity Descriptor

An ideal context activity descriptor can efﬁciently

incorporate the focal person’s activity in spatio-

temporal relationship with activities in its spatial

proximity. Lan et al. (Lan et al., 2012b) propose an

Action Context (AC) descriptor capturing the activity

of the focal person and the behavior of other people

nearby. AC descriptor uses spatial proximity as an

indicator of context. They do not consider whether

the people near-by are engaged in meaningful inter-

actions or not, effectively leading to a semantically

noisy descriptor. Moreover, we argue that AC is rep-

resented by concatenating a set of probability vectors

computed using Bag-of-Words approach with SVM

classiﬁer that adds to ambiguity already existent in

the representation (BoW) for each person. Choi et

al. (Choi et al., 2009) employs well-known shape con-

text idea (Belongie et al., 2002) to propose Spatio-

Temporal Volume (STV) descriptor, which captures

spatial distribution of pose and motion of individuals

in a scene. The descriptor centered on a person of

interest or an anchor is represented as histograms of

people and their poses in different spatial bins around

the anchor. STV descriptor can effectively capture

higher-level spatial relationship of individual interac-

tions. Nonetheless, it is too coarse to capture ﬁner se-

mantically driven contextual relationship of individ-

ual activities in detail.

We develop a novel group context activity (GCA)

descriptor that exploits strategies from above ap-

proaches. Our descriptor is centered on a person (the

focal person), and describes the behaviors of focal

person and its semantic neighbors represented by ar-

ranging individual activity descriptors in polar view.

Let f = { f

, f

, ..., f

} be the set of local activity de-

scriptors formed using Bag-of-Words representation

for all people in the scene, where f

is K-dimensional

vector representing person i’s activity (K is number of

visual codewords). Dense trajectory based descriptors

have shown to be efﬁcient for representing actions in

video, thus we employ approach proposed in (Wang

et al., 2011) to extract motion boundary histogram

(MBH) as local activity descriptors. Given the i-th

person in discovered group G

as the focal person, we

divide its context region into P sub-polar context re-

gions characterized by number of orientation bins and

radial bins (Belongie et al., 2002). Using spatial rela-

tionship between people in discovered group G

, we

extract descriptors in each sub-polar context region

around the focal person. As a result, the group con-

text activity descriptor x

for person i is represented

as a (P + 1) × K dimensional vector including focal

activity descriptor computed as follows:

= [ f

∑

j∈S

(i)

∑

j∈S

(i)

, ...,

∑

j∈S

(i)

] (2)

where S

(i) is set of people in the p-th sub-polar con-

text region of person i. Fig.3 shows the extraction of

group context activity descriptor for a selected person

in a discovered group.

3.4 Inference and Learning

Our model is a standard Conditional Random Field

(CRF) with no hidden variables. We train a multi-

class SVM classiﬁer based on GCA descriptors and

their associated labels to learn and compute singleton

factor φ(a

, x

). Given an observation x

, we use SVM

parameters to compute probabilities for all possible

activity labels. From training data, we use top-down

approach to discover interacting groups in the scene.

All pairs of activity labels in discovered groups are

counted to compute pairwise factor φ(a

, a

Given a new testing scene, our inference task is

to ﬁnd best activity label assignments for all people

detected in the scene. The prediction assignment a

∗

computed by running MAP inference on the network

as:

∗

= argmax

Ψ(a, x) (3)

where Ψ(a, x) is speciﬁed in Eq.1.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Figure 3: Depiction of Group Context Activity (GCA) descriptor extraction. From left to right, people are localized in different

groups using a top-down approach from (Tran et al., 2014); local descriptors are extracted from dense trajectories (Wang et al.,

2011); local BoW is computed for each person’s activity; GCA descriptor is extracted for a selected person in a discovered

group by computing descriptor for each sub-polar context bin.

4 EXPERIMENTS AND RESULTS

In this section, we describe the experiments designed

to evaluate the performance of the proposed group

context activity (GCA) descriptor and framework for

human activity recognition in crowded scenes.

4.1 Datasets

In this work, we choose to use two challenging bench-

mark datasets to evalute our proposed approach in

recognizing human activities in a crowded scene.

The ﬁrst benchmark dataset is Collective Activity

dataset (Choi et al., 2009). The old version of

dataset contains 5 activities in group (Crossing, Wait-

ing, Queuing, Walking and Talking) and recently, the

authors presented a new version of dataset includ-

ing two additional activities (Dancing and Jogging).

HOG based human detection and head pose estima-

tion along with a probabilistic model is used to esti-

mate camera parameters (Choi et al., 2009). Extended

Kalman ﬁltering is employed to extract 3D trajecto-

ries of people in the scene. These automatically ex-

tracted 3D trajectories and head pose estimates are

provided as a part of the dataset. Thus, the dataset

represents real world, noisy observations with occlu-

sions and automatic person detection and trajectory

generation.

The second benchmark dataset is UCLA Court-

yard dataset recently introduced by Amer et. al (Amer

et al., 2012). This dataset contains 106 minutes of

high resolution videos at 30 fps from a bird-eye view

of a courtyard at the UCLA campus. The annotations

in term of bounding boxes, poses, and activity labels

are provided for each frames in video. The dataset

contains 10 primitive human activities which are Rid-

ing Skateboard, Riding Bike, Riding Scooter, Driving

Car, Walking, Talking, Waiting, Reading, Eating, and

Sitting.

4.2 Model Parameters

In using the group discovery algorithm (Tran et al.,

2014), we set parameters that maintain the ratio pro-

posed in (Was et al., 2006) and the social distance

function is modeled as the power function F

(r) =

(1 − r)

, n > 1. We deﬁne 56 activity labels (8

head poses × 7 activity labels) for new version of

Collective Activity dataset and 40 activity labels (8

head poses × 5 activity labels) for UCLA Court-

yard dataset by combining the head poses and ac-

tivity labels. We train a multi-class SVM classiﬁer

which is used to compute singleton factors by utiliz-

ing the libSVM library (Chang and Lin, 2011) with

linear kernel on GCA descriptor. Using discovered

groups from top-down approach, respectively, matri-

ces of size 56 × 56 and 40 × 40 are used to learn and

look up pairwise factors for Collective Activity and

UCLA Courtyard datasets. For recognition, we use

libDAI (Mooij, 2010) to perform inference in our con-

ditional random ﬁeld model.

To compute the MBH descriptors, we set the

neighborhood size N = 32 pixels, the spatial cell

= 2, the temporal cells n

= 3, trajectory length

L = 10, and dense sampling step size W = 5 for dense

tracking. This setting claims to empirically give good

results for a wide range of datasets (see (Wang et al.,

2011) for parameter details). In designing GCA de-

scriptor, we select codebook size of K = 200 by clus-

tering a subset of 100, 000 randomly selected training

features using k-means. In addition, we evaluate our

proposed model in different settings of P, which is

AGroupContextualModelforActivityRecognitioninCrowdedScenes

Table 1: Recognition rates of various proposed methods on Collective Activity dataset (Choi et al., 2009).

Accuracy (%)

Approach Year Walk Cross Queue Wait Talk Jog Dance Avg

Choi (Choi et al., 2009) 2009 57.9 55.4 63.3 64.6 83.6 N/A N/A 65.9

Choi (Choi et al., 2011) 2011 N/A 76.5 78.5 78.5 84.1 94.1 80.5 82.0

Amer (Amer and Todorovic, 2011) 2011 72.2 69.9 96.8 74.1 99.8 87.6 70.2 81.5

Amer (Amer et al., 2012) 2012 74.7 77.2 95.4 78.3 98.4 89.4 72.3 83.6

Lan (Lan et al., 2012b) 2012 68.0 65.0 96.0 68.0 99.0 N/A N/A 79.1

Our Method 60.4 60.6 89.1 80.9 93.1 93.4 95.4 82.9

Table 2: Recognition rates of proposed methods on UCLA Courtyard dataset (Amer et al., 2012).

Accuracy (%)

Approach Walk Wait Talk Drive Car Ride S-board Ride Scooter Ride Bike Read Eat Sit Avg

Amer (Amer et al., 2012) 69.1 67.7 69.6 70.2 71.3 68.4 61.4 67.3 71.3 64.2 68.1

Our Method 74.3 69.9 70.0 N/A N/A N/A N/A 72.8 N/A 70.8 71.4

number of sub-polar context regions around a focal

person. Basically P = R × O where R is number of

radial bins and O is number of orientation bins. How-

ever, given a focal person within his discovered group,

context activity descriptor differs from others by dis-

criminating in orientation distribution rather than ra-

dial distribution. Thus in our case, R is set to 1 and our

GCA descriptor is controlled by P = O number of ori-

entation bins. For the special case when P = 0, GCA

descriptor amounts to the focal person local activity

descriptor without using context (x

= f

). This is the

same for a non-group person in the scene who does

not belong to any discovered groups. Experiments

show that P = 4 and P = 16 achieves the best perfor-

mances in Collective Activity and UCLA Courtyard

datasets, respectively.

4.3 Human Activity Recognition

Evaluation

We summarize the recognition results obtained using

our method and other approaches in Table 1 for Col-

lective Activity dataset using standard 4-folds cross-

validation scheme. As we can see, our proposed ap-

proach achieves recognition rates comparable to state-

of-the-art methods in the new version of Collective

Activity dataset. Fig. 4(Top) shows the confusion ma-

trices obtained on Collective Activity dataset. It lists

the recognition accuracy for each activity individu-

ally. The low values of the non-diagonal elements

imply that the descriptor is highly discriminative with

very low decision ambiguity between different activi-

ties. The confusion matrix also shows the most confu-

sion between Walking and Crossing activities in Col-

lective Activity dataset, which can be explained be-

cause both are essentially Walking activity but with

Figure 4: Confusion matrices for Collective Activity dataset

(Top) and UCLA Courtyard dataset (Bottom).

different scene semantic. Since most of Waiting ac-

tivities are in social interaction with Walking activities

in both datasets, so there is a relatively high confusion

between Waiting and Walking. Overall, the confusion

matrix shows very high accuracy rates in recognizing

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Figure 5: Depiction of computed dense tracks in UCLA Courtyard dataset. Due to low resolution, there are few tracked

trajectories extracted for people in shadow regions. Thus, there are very few local activity descriptors extracted for people in

those regions.

Queue, Talk, Dance and Jog activities. This can be

explained because our group context activity descrip-

tor efﬁciently encodes activities in different contexts.

For UCLA Courtyard dataset, Table 2 shows our

recognition rate in comparison with other proposed

methods, and Fig. 4(Bottom) shows the recognition

confusion matrix. As we can see, our proposed ap-

proach achieves recognition rates that outperform the

state-of-the-art methods in recognizing selected activ-

ities in UCLA Courtyard dataset. However, there is a

limitation in using our framework to UCLA Court-

yard dataset. The dense track algorithm proposed

in (Wang et al., 2011) does not perform well across all

observations in the UCLA Courtyard videos. There

are very small number of dense trajectories extracted

from people in shadow regions in comparison to other

regions (see Fig. 5). Thus, there are not enough ex-

tracted descriptors to build GCA descriptor for those

people in shadow regions. Using alternate feature de-

tectors could alleviate this problem and hence the lim-

itation towards computing the local activity descrip-

tor for UCLA Courtyard videos. Due to this limita-

tion, not all activities are included in our evaluation.

Some activities such as Riding Skateboard, Riding

Bike, Riding Scooter, Driving Car, and Eating are

limited and hence do not provide sufﬁcient exemplars

for learning.

5 CONCLUSION

In this paper, we have proposed an efﬁcient frame-

work for recognizing human activities in crowded

scenes. We have introduced a novel group context

activity descriptor efﬁciently capturing focal person’s

activity and its neighbor’s behavior. Along with group

context activity descriptor, we also proposed a high-

level recognition framework jointly captures the in-

dividual activity and its activity relationships with its

neighbors. We evaluated our approach in two public

benchmark datasets. The results demonstrate that our

approach obtains results comparable to state-of-the-

art in recognizing human activities in crowded scenes.

REFERENCES

Amer, M. R. and Todorovic, S. (2011). A chains model for

localizing participants of group activities in videos. In

Proc. IEEE International Conference on Computer Vi-

sion.

Amer, M. R., Xie, D., Zhao, M., Todorovic, S., and Zhu, S.-

C. (2012). Cost-sensitive top-down/bottom-up infer-

ence for multiscale activity recognition. In Proc. Eu-

ropean Conference on Computer Vision, pages 187–

200.

Belongie, S., Malik, J., and Puzicha, J. (2002). Shape

matching and object recognition using shape contexts.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 24(4):509–522.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for

support vector machines. ACM Transactions on Intel-

ligent Systems and Technology, pages 27:1–27:27.

Chang, M.-C., Krahnstoever, N., Lim, S., and Yu, T. (2010).

Group level activity recognition in crowded environ-

ments across multiple cameras. In Proc. IEEE Inter-

national Conference on Advanced Video and Signal

Based Surveillance, pages 56–63, DC, USA.

Choi, W., Shahid, K., and Savarese, S. (2009). What are

they doing? : collective activity classiﬁcation using

spatio-temporal relationship among people. In Proc.

Visual Surveillance Workshop, ICCV, pages 1282 –

1289.

Choi, W., Shahid, K., and Savarese, S. (2011). Learning

context for collective activity recognition. In Proc.

Computer Vision and Pattern Recognition, pages 3273

–3280, Spring CO, USA.

Cristani, M., Bazzani, L., Paggetti, G., Fossati, A., Tosato,

D., Bue, A. D., Menegaz, G., and Murino, V. (2011).

AGroupContextualModelforActivityRecognitioninCrowdedScenes

Social interaction discovery by statistical analysis of

f-formations. In Proc. British Machine Vision Confer-

ence, pages 23.1–23.12.

Farenzena, M., Bazzani, L., Murino, V., and Cristani, M.

(2009a). Towards a subject-centered analysis for au-

tomated video surveillance. In Proc. International

Conference on Image Analysis and Processing, pages

481–489, Berlin, Heidelberg.

Farenzena, M., Tavano, A., Bazzani, L., Tosato, D., Pagetti,

G., Menegaz, G., Murino, V., and Cristani, M.

(2009b). Social interaction by visual focus of atten-

tion in a three-dimensional environment. In Proc.

Workshop on Pattern Recognition and Artiﬁcial Intel-

ligence for Human Behavior Analysis at AI*IA.

Helbing, D. and Moln

ar, P. (1995). Social force model for

pedestrian dynamics. Physical Review E, 51(5):4282–

4286.

Hoiem, D., Efros, A., and Hebert, M. (2006). Putting ob-

jects in perspective. In Proc. Computer Vision and

Pattern Recognition, volume 2, pages 2137–2144.

Khan, S. M. and Shah, M. (2005). Detecting group activi-

ties using rigidity of formation. In Proc. ACM Inter-

national Conference on Multimedia, MULTIMEDIA

’05, pages 403–406, New York, NY, USA. ACM.

Lan, T., Sigal, L., and Mori, G. (2012a). Social roles in

hierarchical models for human activity recognition.

In Proc. Computer Vision and Pattern Recognition,

pages 1354 –1361.

Lan, T., Wang, Y., Yang, W., Robinovitch, S., and Mori, G.

(2012b). Discriminative latent models for recogniz-

ing contextual group activities. IEEE Transactions on

Pattern Analysis and Machine Intelligence.

Mehran, R., Oyama, A., and Shah, M. (2009). Abnormal

crowd behavior detection using social force model.

In Proc. Computer Vision and Pattern Recognition,

pages 935 –942.

Mooij, J. M. (2010). libDAI: A free and open source C++

library for discrete approximate inference in graphi-

cal models. Journal of Machine Learning Research,

11:2169–2173.

Pavan, M. and Pelillo, M. (2007). Dominant sets and pair-

wise clustering. IEEE Transactions on Pattern Analy-

sis and Machine Intelligence, pages 167 –172.

Ryoo, M. and Aggarwal, J. (2011). Stochastic represen-

tation and recognition of high-level group activities.

International Journal of Computer Vision, pages 183–

200.

Smith, K., Ba, S., Odobez, J.-M., and Gatica-Perez, D.

(2008). Tracking the visual focus of attention for a

varying number of wandering people. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

30:1212 –1229.

Tran, K. (2013). Contextual Descriptors for Human Activity

Recognition. PhD thesis, University of Houston.

Tran, K., Gala, A., Kakadiaris, I., and Shah, S. (2014). Ac-

tivity analysis in crowded environments using social

cues for group discovery and human interaction mod-

eling. Pattern Recognition Letters, 44(0):49 – 57. Pat-

tern Recognition and Crowd Analysis.

Vaswani, N., Roy Chowdhury, A., and Chellappa, R.

(2003). Activity recognition using the dynamics of

the conﬁguration of interacting objects. In Proc. Com-

puter Vision and Pattern Recognition, volume 2, pages

II – 633–40 vol.2.

Wang, H., Klaser, A., Schmid, C., and Liu, C.-L. (2011).

Action recognition by dense trajectories. In Proc.

Computer Vision and Pattern Recognition, pages 3169

–3176.

Was, J., Gudowski, B., and Matuszyk, P. J. (2006). Social

distances model of pedestrian dynamics. In Cellular

Automata for Research and Industry, pages 492–501.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications