Novel Anomalous Event Detection based on Human-object Interactions

Rensso Mora Colque

, Carlos Caetano

, Victor C. de Melo

Guillermo Camara Chavez

and William Robson Schwartz

Universidade Federal de Minas Gerais, DCC, Belo Horizonte, Brazil

Universidade Federal de Ouro Preto, ICEB, Ouro Preto, Brazil

Keywords:

Anomalous Event Detection, Human-object Interaction, Contextual Information.

Abstract:

This study proposes a novel approach to anomalous event detection that collects information from a speciﬁc

context and is ﬂexible enough to work in different scenes (i.e., the camera does need to be at the same location

or in the same scene for the learning and test stages of anomaly event detection), making our approach able

to learn normal patterns (i.e., patterns that do not entail an anomaly) from one scene and be employed in

another as long as it is within the same context. For instance, our approach can learn the normal behavior for

a context such the ofﬁce environment by watching a particular ofﬁce, and then it can monitor the behavior in

another ofﬁce, without being constrained to aspects such as camera location, optical ﬂow or trajectories, as

required by the current works. Our paradigm shift anomalous event detection approach exploits human-object

interactions to learn normal behavior patterns from a speciﬁc context. Such patterns are used afterwards

to detect anomalous events in a different scene. The proof of concept shown in the experimental results

demonstrate the viability of two strategies that exploit this novel paradigm to perform anomaly detection.

1 INTRODUCTION

Anomaly detection for video surveillance has gained

importance in academy and industry. Therefore, re-

searches have focused in extracting characteristics

that help to determine anomalous events, without

departing from context. This challenging task in-

crements its hardness as semantic information of

anomaly is added.

Many studies are based on a typical pipeline em-

ploying representations based on spatiotemporal fea-

tures (low level characteristics extracted from tempo-

ral regions) (Hasan et al., 2016; Wang and Xu, 2016;

Zhou et al., 2016; Cheng et al., 2015; Colque et al.,

2015; Leyva et al., 2017) followed by one-class clas-

siﬁcation to determine whether an event is anomalous.

These approaches model anomalies using characteris-

tics such as velocity (magnitude, orientation), appear-

ance, density and location. However, this type of in-

formation is not well-suited for solving the anomalous

event detection problem since it is constrained to the

same camera view, preventing the detection to be per-

formed on different scenes within the same context.

For instance, it prevents one from learning normal

patterns in one particular ofﬁce and detect anomalous

events in another ofﬁce. On the other hand, our pro-

posed approach is based on human-object interactions

for a speciﬁc context, which is more ﬂexible and al-

low performing anomaly detection on different scenes

within the same context.

Common representations for anomaly detection

are based on the following low-level characteristics:

texture (appearance), optical ﬂow information (mag-

nitude and orientation) and agent location. Such type

of features ﬁts well only in approaches focusing on

anomaly detection for speciﬁc views, i.e., the infor-

mation extracted from a particular scene cannot be

used to detect anomalies in other scenes (e.g., a differ-

ent camera view) because they are camera-dependent.

Therefore, the investigation of approaches focused on

higher-level semantic information is desired for per-

forming a more ﬂexible anomaly detection.

In this paper, we address the problem of anomaly

detection with a different perspective. The main idea

is to learn information regarding one context (e.g., of-

ﬁce or classroom environments) from a speciﬁc scene

(e.g., a particular ofﬁce of classroom) and use that in-

formation to detect anomaly in other scenes belonging

to the same context (e.g., a different ofﬁce or class-

room). In this way, our method is more ﬂexible than

the current state-of-the-art approaches found in the lit-

erature (Fang et al., 2016; Wang and Xu, 2016; Li

Colque, R., Caetano, C., Melo, V., Chavez, G. and Schwartz, W.

Novel Anomalous Event Detection based on Human-object Interactions.

DOI: 10.5220/0006615202930300

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

293-300

ISBN: 978-989-758-290-5

293

et al., 2016; Feng et al., 2016), which are tied to a

single camera view. To be able to perform anomaly

detection in multiple scenes within a given context,

our proposed approach considers higher level seman-

tic information based on human-object interactions to

learn normal patterns.

Different from the current approaches that learn

normal patterns based on spatiotemporal information,

we learn such patterns using human-object interac-

tions. Based on such interactions, we consider two

strategies for anomaly detection: i) unrecognized in-

teractions; and ii) incorrect sequence of interactions.

While the former focuses on ﬁnding interactions that

did not occur during the learning stage, the latter ver-

iﬁes whether the interactions occur in the same se-

quence as in the learning stage, otherwise they are

considered as anomalies.

The main contribution of this work is a new model

for anomaly representation based in human-object in-

teractions. Our model differs signiﬁcantly from the

state-of-the-art approaches in how it collects the scene

information. While the current models use a speciﬁc

view, our model learns patterns from a scene and is

able to detect anomalies in a distinct scene.

2 RELATED WORKS

A common category of the anomaly detection meth-

ods found in the literature addresses the problem by

learning activity patterns from low level handcrafted

visual features. In (Yu, 2014) a new feature is pro-

posed named Mixture of Kernel Dynamic Texture

(MKDT) based on (Doretto et al., 2003), a statistical

model that transforms the video sequence to represent

the appearance and dynamics of the video. Inspired

by the classic HOF feature descriptor, a spatiotem-

poral feature based on both orientation and velocity

was proposed in (Mora-Colque et al., 2017), which

captures information from cuboids (regions with spa-

tial and temporal support). In (Bera et al., 2016), the

authors restricted their work to trajectory-level behav-

iors or movement features per agents, including cur-

rent position, average velocity (including speed and

direction), cluster ﬂow, and the intermediate goal po-

sition. However, as mentioned in (Sabokrou et al.,

2017), hand-crafted features cannot represent video

events very well. Thus, our model does not use low

level features as main characteristics.

Online anomaly detection is another category

found in the literature. In (Javan Roshtkhari and

Levine, 2013), the authors proposed an online unsu-

pervised method, based on spatiotemporal video vol-

ume construction, using both local and global com-

positional information using dense sampling at var-

ious spatial and temporal scales. According to (Xu

et al., 2013), the aforementioned methods model ac-

tivity patterns only considering local or global con-

text, leading to a lack of global or local information

of abnormal motion pattern. In view of that, in (Xu

et al., 2013) a hierarchical framework was proposed

that considers both global and local spatiotemporal

contexts. In (Cheng et al., 2015) was also proposed

a uniﬁed framework to detect both local and global

anomalies using a sparse set of STIPs. The majority

of these models are based on low level features, con-

sequently with the same advantages and drawbacks.

Recently, a growing trend is the employment of

deep neural networks (DNNs). The authors in (Xu

et al., 2015) proposed a novel Appearance and Mo-

tion Deep-Net (AMDN) framework for discovering

anomalous activities. Instead of using handcrafted

features, they learned discriminative feature represen-

tations of both appearance and motion patterns in a

fully unsupervised manner. Still image patches and

dynamic motion ﬁelds represented with optical ﬂow

are used as input of two separate networks, to learn

appearance and motion features, respectively. After,

early fusion is performed by combining pixels with

their corresponding optical ﬂow to learn a joint repre-

sentation. Finally, a late fusion strategy is introduced

to combine the anomaly scores predicted by multiple

one-class SVM classiﬁers. Inspired in the same ar-

chitecture, an incremental model based in Deep rep-

resentation of texture and motion was proposed in (Xu

et al., 2017). A spatiotemporal CNN model was de-

veloped by (Zhou et al., 2016), which accesses the ap-

pearance information and motion extracted from con-

tinuous frames. To capture anomalous events appear-

ing in a small part of the frame.

To take advantage of the best of the two worlds

(handcrafted and deep features), Hasan et al. (Hasan

et al., 2016) used an autoencoder, based on the

two types of features, to learn regularity in video

sequences. According to the authors, the autoen-

coder can model the complex distribution of the reg-

ular dynamics of appearance changes. The authors

in (Ribeiro et al., 2017) proposed a model that col-

lects appearance and motion information, used to feed

a Convolutional Auto-Encoder (CAE). The model is

trained with normal situations uses a regularized re-

construction error (RRE) to recognize normal and ab-

normal events. These models focus on a speciﬁc cam-

era view, therefore the information used for recogniz-

ing anomalies is highly correlated to the view. On the

other hand, our model tries to be able to handle dif-

ferent views, only maintaining the same context.

The aforementioned methods perform single cam-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

294

era view analysis, i.e., they use only one camera view

to train and test. In contrast, the authors of (Loy,

2010) proposed a model based in pair wise analysis

for disjoint cameras. In our study, we use contextual

information to determine possible anomalous event,

however using not only disjoint cameras but also dif-

ferent camera views and distinct environments.

3 PROPOSED APPROACH

Before presenting the proposed approach, two impor-

tant aspects must be discussed: the deﬁnition used for

anomalous patterns and the types of anomalies our ap-

proach intends to detect. In this work, we use the fol-

lowing premise: “if something has not happened be-

fore, it will be considered as an anomalous pattern”.

Hence, any pattern that differs signiﬁcantly from ob-

served patterns during the training stage should be

labeled as anomalous. It is important to emphasize

that such premise is very common and used on other

anomaly detection methods in the literature. Regard-

ing the types of anomaly, our model is oriented to

detect anomalies in different environments, implying

that typical information employed for anomaly detec-

tion (e.g., velocity and orientation captured through

optical ﬂow), cannot be used due to the change of

camera position. To overcome such problem, our ap-

proach describes the scene as a set of interactions be-

tween persons and objects.

Our approach is divided in two main steps:

anomaly representation and anomalous pattern de-

tection. In the ﬁrst step, a preprocessing stage al-

lows our model to describe the sequence creating

structures that represent interactions. Basically, the

idea is to collect these interactions as patterns belong-

ing to a certain person. In the second step, the ex-

tracted patterns are compared to detect whether some

of them may be considered as anomalous. In the

ﬁrst phase, all subjects contribute by creating a set

of patterns with their performed interactions. In the

second phase, patterns found in training are used to

detect whether someone is performing any anomaly.

Thus, we can detect and locate which person in a spe-

ciﬁc frame is performing a strange or unknown activ-

ity. Here, it is important to highlight that video se-

quences belonging to training set are totally different

from testing sequences. However, all of them have

the particularity of being or having the same context.

3.1 Anomaly Representation

To describe the scene, our model is based on a human-

object interaction representation. Therefore, we must

deﬁne the scene actors and how they interact to each

other. First, our model detects people and objects

present in the scene using the model proposed in (Liu

et al., 2015). Then, to determine which objects are

interacting with a person, we employ a human pose

estimation approach (Eichner et al., 2012) to locate

the person’s hands and to link the objects with people.

Finally, we estimate people’s trajectories path (track-

lets) using a tracking algorithm based on Kalman ﬁl-

ter

Before describing the sequence in a set of struc-

tures, our model deﬁnes the interactions and track-

lets. The ﬁrst step consists on linking each person

with objects. This is a challenging task, especially

when depth information is not present. The proposed

model uses a straightforward heuristic to link human

and objects by employing the position of the hands to

determine the distance between objects and the per-

son. After each subject in the scene is linked with

the interacted objects, the next step consists on track-

ing the person. Note that, on the ﬁrst step, we can

only determine unrecognized objects and, in the sec-

ond step, the tracking allows our model to create a

temporal representation that aims at ﬁnding unrecog-

nized sequence interactions.

Similar to a graph, we employ a structure that rep-

resents the object interactions of a certain person in

a particular frame by connecting them with edges,

which we called kernel. Here, each interaction may be

represented by a label meaning the set of objects that

this speciﬁc interaction contains. Thus, we can see

the tracklet representation as a list of kernels formed

by square and circle nodes. Square nodes represent

the person belonging to the tracklet, and circle nodes

represent the objects that have been linked with such

person. Figure 1 illustrates the representation. Note

that objects may appear many times. This is done by

a unique interaction label composed by the sorted ob-

ject identiﬁers.

A set of labels that represent the kernels are col-

lected. These labels are used to built a “dictionary”,

where each word represents a speciﬁc interaction be-

tween a person and one or more objects. For instance,

interaction with the object C introduces the word “C ”

to the dictionary. In Figure 1, we can list the words

, C

−C

, and C

−C

. This dictionary saves

the knowledge regarding normal interactions between

humans and objects that have been observed during

the training stage.

Since the main goal of this work is to detect anomalies

exploiting a graph-based approach, we do not focus deeply

on the preprocessing steps (i.e., detection and tracking).

Novel Anomalous Event Detection based on Human-object Interactions

295

Frame

i-1

Frame

i+1

Frame

i+2

C1 - C2

C1 - C2 - C3 C1

Figure 1: Interaction representation from a person tracklet.

Squares represent the person at a speciﬁc frame and circles

the linked objects, where different colors indicate different

objects. For instance, in frame i − 1, the blue square repre-

sents the person and the circles in blue and gray represent

two different objects. The interaction with these objects cre-

ates the kernel called C

−C

. In next frame, i, the actor in-

teracts with a single object C

, generating the kernel named

3.2 Anomalous Pattern Detection

Given the interaction structures, the following step

is to detect anomalous patterns on the testing phase.

At this stage, our model pursues two different strate-

gies: (i) unrecognized interactions; and (ii) correct

sequence interaction. The goal of these strategies is

to recognize anomaly types according to the learned

context information.

3.2.1 Unrecognized Interactions

In our model, this strategy represents the ﬁrst level

of anomaly detection, we can see it as an atomic

anomaly detection. The idea is quite simple: during

the training phase a list of interactions is built, then,

whether an interaction is not present in this list, such

node is marked as anomalous. This strategy intends

to recognize when an object, or a set of objects, is

present in the test but was not present in the train-

ing phase. For instance, during the training phase no-

body interacted with the ﬁre extinguisher in a com-

puter laboratory, however, if such interaction occurs

in the testing phase, this would represent an anoma-

lous event. Figure 2 illustrates an unrecognized in-

teraction, where the object U was not seen during the

training phase, therefore it is not in the dictionary, in-

dicating an unrecognized interaction.

3.2.2 Correct Sequence Interaction

Inspired by (Crispim et al., 2016), our second model

strategy explores statistical information of sequence

interactions. The idea is to detect some events ac-

cording their occurrence probability, in which low

probabilities indicate anomalous events. Thus, our

model collects interaction information from kernel

structures. Such information is saved in two hash ta-

bles, one of them used to count the number of oc-

currences for a speciﬁc word and a second used to

Dictionary

Frame

i-2

Frame

i-1

Frame

i+1

Frame

i+2

Figure 2: Example of unrecognized interaction. Frame i +1

illustrates the interaction with an unknown object.

count pairs of consecutive interactions. For instance,

in Figure 1, we have two occurrences of word C

, and

one occurrence of words C

− C

and C

− C

At same time, we have the following occurrence se-

quences: {(C

−C

| C

), (C

| C

−C

), (C

−

−C

| C

)}.

Widely applied in modeling language, N-Grams

utilizes the assumption that the probability of a word

depends only on the previous words, this assumption

is called Markov assumption. Markov models are the

class of probabilistic models that assume we can pre-

dict the probability of some future unit without look-

ing too far into the past (Jurafsky and Martin, 2009).

Our model use N-grams to compute the probability

of a speciﬁc sequence using maximum likelihood as

following equation

P(w

...w

i−1

) ≈

count(w

i−1

)

count(w

i−1

)

(1)

where, each word w

is represented by a kernel label,

i.e., the object set that interacts with the person in a

speciﬁc frame. This information is found in hash ta-

bles built from interaction sequences. Finally, if the

probability P(w

i−1

) is less than a threshold η (0.6

in our experiments), then such interaction and conse-

quently the observed tracklet is marked as anomalous.

Figure 3 presents a brief example of correct se-

quence interaction. In this case, the anomaly is given

by the low probability of happening a speciﬁc se-

quence. In particular, the probability (Equation 1) in

frame i is smaller than η = 0.6, therefore, we detect

the entire tracklet as anomalous.

4 EXPERIMENTAL RESULTS

This section reports our results and is divided in two

parts: (i) tests based on unrecognized interactions;

and (ii) test based in anomaly sequence interactions.

We use the same set of videos for training both parts.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

296

C1-C1 10

C1-C2 1

C2-C3 7

C2-C5 15

Cn-C1 3

Cn-C4 2

Word

Frequency

Pair

Frequency

C1 60

C2 50

C3 26

C4 35

C5 32

C6 2

Cn 6

C1-C1

C1-C2

C2-C3

C2-C5

Cn-C1

Cn-C4

Frame

i-1

Frame

i+1

P=1/60

Figure 3: Example of incorrect sequence interaction. The

probability in frame i is 1/60, which would be considered

anomalous for a threshold equals to η = 0.6.

Dataset. Our dataset was built by different views cap-

tured on different laboratories. We created a set of

training videos to represent common situations, in-

cluding: people interacting with chairs, laptops, back-

packs and monitors. The dataset is divided in training

and testing video sequences. The ground-truth was

composed by every situation that was not present in

the training videos. It is important to highlight the

diversity not only on camera view but also in environ-

ments. Our dataset contains clips with different places

and views for the same context which is a computer

laboratory. For instance, various video sequences are

recorded with different camera position in the room

Our proposed dataset is composed of everyday

laboratory work activities. For instance, students sit-

ting and working in their computers, people entering

and passing in front of the camera. Duration of the

videos does not exceed ﬁve minutes of recording. It

is composed by 11 clips for testing and 20 for train-

ing. As anomaly samples, we consider all interactions

that a person perform with a coffee machine or a pil-

low (both interactions did not take place during train-

ing). We also consider anomalous sequences, the ab-

normal cases where a person interacted with the cam-

era before the backpack, since in training videos peo-

ple who interacted with the camera always did it with

the backpack ﬁrst.

Output and Settings for Test Model. In the ﬁrst

stage, our model detects actors and human poses. In

this case we use default CNN settings and precom-

puted models provided by (Liu et al., 2015) and (Eich-

ner et al., 2012). In the second stage, we have two

variables: (i) tracklet building variable δ

= 50; and

(ii) Jaccard index j = 0.7. The tracklet building vari-

This novel dataset will be made publicly available after

the acceptance of this paper.

able is a threshold value that bounds the distance be-

tween tracklets and people occurrence. Thus, for each

frame, bounding boxes are linked to a speciﬁc track-

let if the distance is smaller than this referred value

(δ

). The Jaccard index is a complement to our track-

ing model since in some cases the bounding boxes of

certain person may vary by its pose, for instance, if

a person stretches his/her arms the related bounding

box will grow and only a distance threshold may give

a wrong answer. In view of that, we use an occu-

pation criteria based in overlapping areas considering

the Jaccard index. To build the interaction representa-

tion for a particular tracklet, we consider a threshold

= 100 to link objects with a person tracklet. Both

and δ

are measured in pixels.

During the test, the interaction structure is labeled

as abnormal when an anomaly is detected by a of the

proposed strategies. Finally, we join the anomalous

events into anomaly intervals belonging to a determi-

nate subject (tracklet). It allows us to determine which

subject performed the anomaly and at what speciﬁc

frame.

To evaluate the detection results, we use the metric

proposed in (Cao et al., 2010). Ground truth anoma-

lies are denoted by Q

= {Q

,...,Q

} and the

output results are denoted by Q

= {Q

,...,Q

Function HG(Q

) denotes whether a ground truth

interval Q

is detected. Function T D(Q

) denotes

whether a detected interval Q

makes sense or not.

HG(Q

) and T D(Q

) are judged by checking whether

Jaccard index is above a threshold th (0.30 in our ex-

periments), according to

HG(Q

) =

(

1, if ∃ Q

, s.t.

∩Q

> th

0, otherwise

T D(Q

) =







1, if ∃ Q

, s.t.

∩Q

> th

0, otherwise

Based on the previous equations, we then compute

precision and recall metrics, deﬁned as

Precision =

∑

i=1

HG(Q

)

Recall =

∑

j=1

T D(Q

)

4.1 Unrecognized Interactions

These experiments are oriented to determine anoma-

lies evolving unrecognized objects. For instance, a

pillow and a coffee maker usually are not in common

computer laboratory and, although these objects are

not dangerous or suspicious we take in account our

premise that something that not appeared during the

training will be considered as abnormal.

Novel Anomalous Event Detection based on Human-object Interactions

297

Table 1 presents the results of such experiments.

Precision and recall values are presented as tuple

(P/R). We can see that the result of the tests related

to the detection of anomalies due to unknown inter-

actions presented positive results in the ﬁrst level of

detection. Moreover, this strategy was satisfactory to

detect anomalous sequences. However, tests 3 and 4

presented a lower value of precision and recall than

expected for both methods.

The reason for the low accuracy in test 3 is that

we do not have a sense of depth in the image analysis.

In view of that, a person with the hands far from an

object in the depth can be considered as an interac-

tion. The low recall in test 4 occurs because the pose

estimator sometimes fails to detect the coordinates of

the person hands. Thereby, the distance between the

hands of the actor and the objects exceeds the limit to

deﬁne an interaction, turning it difﬁcult to detect the

anomalies.

Table 1: Precision and Recall (P/R) results for our dataset.

Strategy 1 Strategy 2

Test 1 1/1 0/0

Test 2 1/0.5 1/0.5

Test 3 0.25/0 0.25/0.1

Test 4 1/0.11 1/0.07

The idea of this experiment is to show that our

model may recognize an anomalous situations given

by a previously unknown interaction. We can sup-

pose that such binary problem is easy to solve when

we talk about only objects that do not appear during

the training. However, it is more than only unrecog-

nized objects since we are also looking for unknown

interactions. Thus, during the training phase, interac-

tions contain no more than three speciﬁc objects and

in testing phase there are interactions with these spe-

ciﬁc objects and another that is not usually seen but

appears in training phase.

For instance, interactions with laptops, chairs,

notebooks and backpacks are common in computer

laboratories, nonetheless, some combinations like

computer and coffee cup may result abnormal, since

in our premise if it not happened in training it will

be anomaly in testing. Based on this, maybe there is

no allowed drink in the laboratory area, however, in-

side the environment there is a speciﬁc coffee area.

Another goal of this experiment is to determine in-

teractions with unrecognized objects (e.g, abnormal

objects, such as people interacting with weapons may

be considered an abnormal situation in a laboratory).

Figure 4 shows simple examples of our model

when an anomaly is detected. Subjects and objects are

marked by blue and green bounding boxes, respec-

tively. The hand positions are marked as pink circles.

4.2 Correct Sequence Interactions

Here, our main goal is to collect the temporal infor-

mation about the relations in interaction structures. In

this experiments the goal is to discover sequences that

have not been seen before (e.g., to enter a bank ofﬁce

you ﬁrst need to pass through a metal detector secu-

rity door). In this stage, we create a synthetic situ-

ation where a video camera is placed in a particular

location. Such events happened sometimes in train-

ing phase and the test case is when the device is re-

moved. The results for this experiment are presented

in Table 2. As it was expected, N-grams (Strategy 2)

achieved the best results on determining anomaly se-

quences.

Table 2: Precision and Recall (P/R) results of our dataset.

Strategy 1 Strategy 2

Test 5 0/0 1/0.5

Test 6 0/0.5 0.5/0.5

Test 7 1/0.5 1/1

Test 8 0/0 1/0.33

Test 9 0/0 1/0.5

Test 10 0/0 1/1

Test 11 0/0 1/0.5

Figure 5 shows simple examples of our model

when an sequence interaction anomaly is detected.

Subjects and objects are marked by blue and green

bounding boxes, respectively. The hand positions are

marked as pink circles. In these sequences, the con-

text is that nobody took the camera placed in top of

the locker, however during the train the student place

the camera in this position.

4.3 Discussion

In this section, we discuss important aspects regard-

ing the proposed method.

Dependency on the low level tasks: It is important

to highlight that our goal is to introduce a new model

to recognize and detect anomalies, instead of using

only a speciﬁc environment, our proposed model at-

tempts to learn context based on human-object inter-

actions. We deal with some artifacts and mistakes due

to the object detector, tracking and pose recognition

approaches that could be overcome by employing bet-

ter approaches to generate tracklets.

Human-object interaction recognition: According

to experimental results, it was clear that using only

euclidean distance to link humans and objects is not

the best possible strategy. One possible improvement

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

298

(a) (b)

Figure 4: Sample outputs of the model. In this images unrecognized objects as: pillow and coffee maker machine was not

seen in training phase. Blue bounding box correspond to a person, green for objects. Hands are marked with pink points.

Figure 5: Example of unknown sequences of interactions. The student removes camera, this action was no presented in

training phase.

would be to use depth information to improve the link

between humans and objects. However, we did not

include depth information since surveillance systems

generally use single camera for certain environments

(turning it impossible to estimate depth).

Testing with other datasets: In literature we ﬁnd

several well-known datasets for anomalous event de-

tection problem. However, such datasets are pretty

speciﬁc for crowds with low resolution. Nonetheless,

our model is based on human-object interactions and,

in most cases of the aforementioned datasets, this type

of relations are not clear given the clarity of the video

sequences.

Comparisons with other approaches: As our

model does not have representative results in other

datasets, literature algorithms would not work prop-

erly in our proposed dataset due to the type of rep-

resentations learned by them (characteristics such as

speed, orientation, trajectories, appearance, textures).

These types of features change signiﬁcantly from

scene to scene, not being representative when the en-

vironment changes.

Experimental Validation: In our experiments, we

present artiﬁcial situations that could be serve as

many other. Since anomalies can be seen as particular

situations for a certain environment, our experiments

focused on presenting the essence of our model.

5 CONCLUSIONS

In this paper, we proposed an approach for anomaly

detection and localization based in context informa-

tion. Instead of using common information, such

as texture, magnitude or orientation, we proposed a

model based in human-object interactions. An impor-

tant contribution of this study is the different perspec-

tive about the information collecting and the anomaly

representation. Our model was capable of detect-

ing anomalies and to determine which individual per-

forms an anomalous activity. As future works, we in-

tend to focus on solving the weakness of our model

presented in experiments section, such as human-

object interaction in video sequences.

ACKNOWLEDGMENTS

The authors would like to thank the Brazilian National

Research Council – CNPq (Grant #311053/2016-5),

Novel Anomalous Event Detection based on Human-object Interactions

299

the Minas Gerais Research Foundation – FAPEMIG

(Grants APQ-00567-14 and PPM-00540-17) and the

Coordination for the Improvement of Higher Educa-

tion Personnel – CAPES (DeepEyes Project).

REFERENCES

Bera, A., Kim, S., and Manocha, D. (2016). Realtime

anomaly detection using trajectory-level crowd behav-

ior learning. In CVPRW.

Cao, L., Tian, Y., Liu, Z., Yao, B., Zhang, Z., and Huang,

T. S. (2010). Action detection using multiple spatial-

temporal interest point features. Multimedia and Expo

(ICME), 2010 IEEE International Conference on.

Cheng, K.-W., Chen, Y.-T., and Fang, W.-H. (2015). Video

anomaly detection and localization using hierarchi-

cal feature representation and gaussian process regres-

sion. In CVPR.

Colque, R. V. H. M., Junior, C. A. C., and Schwartz, W. R.

(2015). Histograms of optical ﬂow orientation and

magnitude to detect anomalous events in videos. In

SIBGRAPI.

Crispim, C. F., Koperski, M., Cosar, S., and Bremond, F.

(2016). Semi-supervised understanding of complex

activities from temporal concepts. In AVSS.

Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S. (2003).

Dynamic textures. International Journal of Computer

Vision.

Eichner, M., Marin-Jimenez, M., Zisserman, A., and Fer-

rari, V. (2012). 2D articulated human pose estimation

and retrieval in (almost) unconstrained still images.

International Journal of Computer Vision.

Fang, Z., Fei, F., Fang, Y., Lee, C., Xiong, N., Shu, L.,

and Chen, S. (2016). Abnormal event detection in

crowded scenes based on deep learning. Multimedia

Tools and Applications.

Feng, Y., Yuan, Y., and Lu, X. (2016). Deep Representation

for Abnormal Event Detection in Crowded Scenes.

Proceedings of the 2016 ACM on Multimedia Confer-

ence.

Hasan, M., Choi, J., Neumann, J., Roy-Chowdhury, A. K.,

and Davis, L. S. (2016). Learning Temporal Regular-

ity in Video Sequences. In CVPR.

Javan Roshtkhari, M. and Levine, M. D. (2013). An on-line,

real-time learning method for detecting anomalies in

videos using spatio-temporal compositions. Comput.

Vis. Image Underst.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language

Processing (2Nd Edition). Prentice-Hall, Inc.

Leyva, R., Sanchez, V., and Li, C. T. (2017). Video anomaly

detection with compact feature sets for online perfor-

mance. IEEE Transactions on Image Processing.

Li, F., Yang, W., and Liao, Q. (2016). An efﬁcient anomaly

detection approach in surveillance video based on ori-

ented GMM. In ICASSP.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.,

Fu, C.-Y., and Berg, A. C. (2015). SSD: Single shot

multibox detector. arXiv preprint arXiv:1512.02325.

Loy, C. C. (2010). Activity Understanding and Unusual

Event Detection in Surveillance Videos. Queen Mary

University of London.

Mora-Colque, R. V. H., Caetano, C., de Andrade, M. T. L.,

and Schwartz, W. R. (2017). Histograms of optical

ﬂow orientation and magnitude and entropy to detect

anomalous events in videos. IEEE Trans. Circuits

Syst. Video Techn.

Ribeiro, M., Lazzaretti, A. E., and Lopes, H. S. (2017).

A study of deep convolutional auto-encoders for

anomaly detection in videos. Pattern Recognition Let-

ters.

Sabokrou, M., Fayyaz, M., Fathy, M., and Klette, R. (2017).

Deep-cascade: Cascading 3d deep neural networks for

fast anomaly detection and localization in crowded

scenes. IEEE Transactions on Image Processing,

pages 1992–2004.

Wang, J. and Xu, Z. (2016). Spatio-temporal texture mod-

elling for real-time crowd anomaly detection. Com-

puter Vision and Image Understanding.

Xu, D., Ricci, E., Yan, Y., Song, J., and Sebe, N. (2015).

Learning deep representations of appearance and mo-

tion for anomalous event detection. In British Ma-

chine Vision Conference (BMVC).

Xu, D., Ricci, E., Yan, Y., Song, J., and Sebe, N. (2017).

Detecting anomalous events in videos by learning

deep representations of appearance and motion. Com-

put. Vis. Image Underst.

Xu, D., Wu, X., Song, D., Li, N., and Chen, Y.-L. (2013).

Hierarchical activity discovery within spatio-temporal

context for video anomaly detection. In International

Conference on Image Processing (ICIP).

Yu, S. D. X. W. X. (2014). Crowded abnormal detec-

tion based on mixture of kernel dynamic texture. In

ICALIP.

Zhou, S., Shen, W., Zeng, D., Fang, M., Wei, Y., and

Zhang, Z. (2016). Spatial-temporal convolutional neu-

ral networks for anomaly detection and localization in

crowded scenes. Signal Processing: Image Commu-

nication.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

300