Real-time Surveillance based Crime Detection for Edge Devices

Sai Vishwanath Venkatesh

1 a

, Adithya Prem Anand

2 b

, Gokul Sahar S.

2 c

Akshay Ramakrishnan

2 d

and Vineeth Vijayaraghavan

SRM, Institute of Science and Technology, Chennai, India

SSN, College of Engineering, Chennai, Tamil Nadu, India

Solarillion Foundation, Chennai, Tamil Nadu, India

{vsaivishwanath, adithya.prem98, gokulsahar, akshayramakrishnan10}@gmail.com, vineethv@ieee.org

Keywords:

Real-time, Surveillance, Edge Devices, Resource-constrained, Crime Detection.

Abstract:

There is a growing use of surveillance cameras to maintain a log of events that would help in the identiﬁ-

cation of criminal activities. However, it is necessary to continuously monitor the acquired footage which

contributes to increased labor costs but more importantly, violation of privacy. Therefore, we need decen-

tralized surveillance systems that function autonomously in real-time to reduce crime rates even further. In

our work, we discuss an efﬁcient method of crime detection using Deep Learning, that can be used for on-

device crime monitoring. By making the inferences on-device, we can reduce the latency, the cost overhead

for the collection of data into a centralized unit and counteract the lack of privacy. Using the concept of Early-

Stopping–Multiple Instance Learning to provide low inference time, we build specialized models for crime

detection using two real-world datasets in the domain. We implement the concept of Sub-Nyquist sampling on

a video and introduce a metric η

comp

for evaluating the reduction of computation due to undersampling. On

average, our models tested on Raspberry Pi 3 Model B provide a 30% increase of accuracy over benchmarks,

computational savings as 80.23% and around 13 times lesser inference times. This allows for the development

of efﬁcient and accurate real-time implementation on edge devices for on-device crime detection.

1 INTRODUCTION

In the past decade, the number of violent crime rates

has reduced across the globe (UNODC, 2017). Re-

ports (Alexandrie, 2017) (Piza et al., 2019) suggest

that this can majorly be attributed to the explosive

rise in the number of surveillance systems being em-

ployed. However, with this burgeoning rise comes

problems that need solving. Apart from the cost of in-

stalling surveillance systems, monitoring surveillance

feed is a recurring investment that makes surveillance

installation questionable of value to its stakeholders.

Monitoring surveillance also poses major privacy is-

sues. Policymakers and governments hesitate to up-

scale existing surveillance systems due to the increase

in labor cost and inevitable social unrest that could

be caused by the decrease in public privacy. How-

ever, according to the report (Nancy G. La Vigne and

https://orcid.org/0000-0001-6568-6259

https://orcid.org/0000-0002-3420-8809

https://orcid.org/0000-0002-1318-2216

https://orcid.org/0000-0001-6861-5675

Dwyer, 2015), police ofﬁcials state more crimes are

deterred by active surveillance monitoring and inter-

vening. This begs for the need for better surveillance

monitoring techniques that can preserve privacy as

well as offer fast crime detection for immediate inter-

vention.The authors of this paper propose a real-time

edge implementation for surveillance based crime de-

tection to address the aforementioned needs.

This paper is organized as follows: Section 2

discusses the related research work carried out in

crime detection and edge research. The data used

for experimentation is described in Section 3 and

the methods employed for efﬁcient feature extrac-

tion is compared in Section 4. Section 5 brieﬂy

explains Early Stopping-Multiple Instance Learning

techniques implemented. Model implementation and

selection of model parameters are elucidated in Sec-

tion 6 followed by the results obtained and evaluation

of said models against benchmarks from contempo-

rary works in Section 7.

2 BACKGROUND

Video Action Recognition is currently one of the most

prominent ﬁelds of research in Computer Vision.

Brendel et al. presented an exemplar-based approach

to detect activities in realistic videos by considering

them as a time series of human postures (Brendel

and Todorovic, 2010). Simonyan et al. used a two-

stream convolutional network for capturing spatial

and temporal representations of videos and employed

multi GPU training of these representations for activ-

ity recognition (Simonyan and Zisserman, 2014). Ac-

tivity recognition has also been extended to analyze

the anomalies in surveillance videos. Sultani et al.

used 3D Convolution (C3D) and Tube-CNN (TCNN)

models to classify crimes in surveillance videos by

detecting anomalies (Sultani et al., 2018). Tay et

al. proposes an image-based Convolutional Neural

Network for the identiﬁcation of abnormal activities

present in the video (Tay et al., 2019). Zhu et al.

proposes a motion aware feature using autoencoders

to detect anomalies in videos more effectively (Zhu

and Newsam, 2019). However, existing surveillance

video activity detection models require a lot of over-

head given that the recorded data has to be sent to a

centralized server for processing (Cui et al., 2019).

This can be eliminated by using compressed edge-

based models for prediction thereby enabling decen-

tralized implementations.

In the recent past, we have seen a signiﬁcant in-

crease in edge oriented research with the implementa-

tion of powerful Machine Learning and Deep Learn-

ing models on these devices. Machine Learning al-

gorithms like k-Nearest Neighbours (kNN) and tree-

based algorithms have paved way for the develop-

ment of novel algorithms like ProtoNN (Gupta et al.,

2017) and Bonsai (Kumar et al., 2017) which ad-

dress the problem of real-time prediction on resource-

constrained devices by signiﬁcantly reducing the

model size and inference time. Meng et al. proposed

an alternative for deploying computationally expen-

sive models on resource-constrained devices by intro-

ducing a Two-Bit Network (TBN) which helps in the

compression of large models like CNN (Meng et al.,

2017). Dennis et al. proposed a new method of Multi-

ple Instance Learning with Early Stopping to work on

sequential data which can be used for edge implemen-

tation of deep models like RNN and LSTM (Dennis

et al., 2018).

In our work, we have proposed a method that

enables real-time crime detection by encompassing

video action recognition strategies with edge device

compatibility without any compromise on accuracy.

3 DATASET

Since we aim to achieve effective on-device crime

detection for real-world surveillance, parameters like

the number of videos for analysis, presence of re-

alistic anomalies and proper annotations of videos

are taken into consideration to decide the data to be

used for our experiments. On this basis, we select 2

datasets — UCF-Crime and Peliculas — for our work.

Table 1: Dataset Description.

Parameter UCF-Crime Peliculas

# of videos 1900 (749 used) 203

# of classes 13 (8 used) 2

Frames per

second(fps)

30 30

Average

Frames

7247 (4 mins) 50

Frame aspect

Ratio

240×320 px 240×320 px

Annotated Yes Yes

3.1 UCF-Crime

The videos in this dataset have been sourced by the

University of Central Florida (UCF) (Sultani et al.,

2018) using search queries of different languages

from broadcasting platforms such as YouTube and

LiveLeak. Furthermore, videos in the dataset have

been resized to a standard 240×320 pixels and are

sampled at a frame rate of 30 frames per second (fps).

The crimes labeled within these videos include

eight classes namely Assault, Arson, Fighting, Bur-

glary, Explosion, Arrest, Abuse and Road Accidents.

Additionally, the collection contains Normal videos,

i.e., videos that do not contain any crime footage. The

class-wise distribution of the percentage of videos in

each class of UCF-Crime is illustrated in Figure 1.

Figure 1: Class-wise percentage of frames in UCF-Crime.

3.2 Peliculas

Peliculas (Gracia et al., 2015) is a binary class dataset

consolidated by Gracia et al. contains real-world

ﬁghting videos and non-ﬁght videos taken from UCF

101 (Soomro et al., 2012), the Hockey Fighting

dataset (Nievas et al., 2011) as well as ﬁght scenes

from the Movies Dataset (Nievas et al., 2011). Fur-

thermore, these videos have been resized to a standard

240×320 pixels and each video is sampled at a frame

rate of 50 fps.

Table 1 describes the parameters of the datasets

considered.

4 ANALYSIS OF FEATURE

EXTRACTION TECHNIQUES

Videos can be interpreted as a sequence of frames

that contain spatial and temporal elements (Laptev

et al., 2008). For effective activity recognition, the

extracted features must capture both elements for a

set of frames. Since we wish to implement our work

in real-time, we aim to minimise the computational

cost of feature extraction methods used.

4.1 Histogram of Oriented Gradients

Histogram of Oriented Gradients (HOG) is an image

processing technique that captures the spatial orienta-

tion of pixels in an image. It is widely used in human

detection (Dalal and Triggs, 2005) and pose estima-

tion (Brendel and Todorovic, 2010) to recognize ac-

tions from a single high resolution image. Although it

captures spatial features, HOG is incapable of captur-

ing the essential temporal elements required for video

activity recognition.

4.2 Deep Spatio-temporal Extraction

In this method, each frame from the video is taken and

passed through a pre-trained deep learning model.

The values from the penultimate layer of the pre-

trained deep model yield useful spatial information as

illustrated in (He et al., 2016; Pratt, 1993). This archi-

tecture can be extended to gather temporal features.

Some of the work that use this technique include

C3D (Tran et al., 2015), VTN (Girdhar et al., 2019)

and I3D (Carreira and Zisserman, 2017). However,

since these models have extremely deep architec-

tures due to several models being pipelined, they

prove to be signiﬁcantly slower than other methods as

seen in Table 2 and incapable of efﬁcient edge

implementation.

4.3 Histogram of Optical Flow

Histogram of Optical Flow is an image processing

technique that captures both spatial and temporal fea-

tures of a video. We deﬁne such features as spatio-

temporal features. Optical Flow is the pattern of ap-

parent motion of the different entities in a video. Cal-

culating the histogram of optical ﬂow across the video

frames yields values that can serve as features that

encompass the temporal variations in between two

frames. The process of calculating HOF is shown

in Figure 2 where the vector magnitudes and orienta-

tions are coded using the HSV model for visualization

purposes. Optical ﬂow estimation algorithms can be

broadly classiﬁed into two main categories based on

the density of points considered namely Sparse Opti-

cal Flow and Dense Optical Flow.

4.3.1 Sparse Optical Flow

In this method, the optical ﬂow is estimated on certain

selected points within the frame under consideration.

Lucas-Kanade proposes (Lucas et al., 1981) one such

sparse optical ﬂow estimation method.







∑



−1



−

∑

−

∑



(1)

Equation 1 gives the formula for the calculation of u

and v, i.e., optical ﬂow coefﬁcients for a particular

point (x,y,t). Here, x, y and t represent the coordinates

of a pixel in space-time.

Since sparse optical ﬂow considers the optical ﬂow

only for certain points, it results in a different number

of features for a different set of frames. This gives us

a reason to shift our attention towards dense optical

ﬂow.

4.3.2 Dense Optical Flow

Dense Optical Flow is an optical ﬂow estimation tech-

nique that considers the whole frame for the esti-

mation. Unlike sparse optical ﬂow, dense optical

ﬂow produces a ﬁxed length of features provided the

frames fed into the algorithm have a constant aspect

ratio. We use the dense optical ﬂow algorithm given

in (Farneb

ack, 2003). One of the reasons for adopting

this method is its ability to take every nuance in the

frame for the extraction of features.

(x) = x

x + b

x + c

(2)

(x) = f

(x + d)

= x

x + b

x + c

(3)

Table 2: Comparison of Feature Extraction Properties.

Method

Time

Taken(ms)

Features

Length

Cores

Used

Miscellaneous

HOF (Dense) 6.36 540 1 Spatio-temporal Features extracted

HOG 0.74 324 1

Only spatial information encoded

Fastest among the methods considered

Deep-Spatio

Temporal (Resnet)

42.36 512 4 Requires extra memory to store pretrained model

Figure 2: Histogram of Optical Flow.

d =

−1

− b

)

(4)

where,

f represents a frame,

is a symmetric matrix,

is a vector,

is a scalar.

d is the global displacement from frame f

(Equation

2) to frame f

(Equation 3).

Here, f

and f

represent the subsequent frames

from the video considered for the estimation of op-

tical ﬂow. In (Farneb

ack, 2003), every frame con-

sidered for optical ﬂow estimation is resolved into a

quadratic equation as shown in Equation 2.

Based on the comparative study (Table 2), we can

conclude qualitatively that Histogram of Optical Flow

is best suited for video activity recognition since it

provides temporal information. Due to conclusions

drawn in Section 4.3.1 we consider a dense imple-

mentation for our experiments over sparse.

4.4 Time Series Analysis - HOF

Principal Component Analysis (Pearson, 1901) is

used to condense HOF features into one dimension

for representational ease. Representing HOF in this

format helps us to observe its similarities to one-

dimensional time series. We can observe from Figure

3 that HOF displays characteristics representative of

time series data.

Time series data refers to a set of observations ob-

tained sequentially over time such as sensor readings

(Anguita et al., 2012) or stock market prices (Agarwal

and Sabitha, 2016). With reference to our work, time

corresponds to consecutive frames in a video.

Figure 3: Time series representation of HOF across a crime.

Furthermore, we believe our representational

HOF poses distinctive similarities to waveforms illus-

trated by other time series class recognition datasets

such as sensor-based Human Activity Recognition

(HAR) (Anguita et al., 2013) and Wake Word Dataset

(Warden, 2018). These waveforms are visibly distin-

guishable between different classes over a period of

time as observed in Figure 3. Also another similar-

ity between these datasets is that these are sparsely

annotated with activities. For example, HAR labels

human activities amongst noise, while UCF Crime la-

bels crimes as actions against normal videos.

This time-series data can be processed by Recur-

rent Neural Networks and LSTMs, which have proven

to discern activities on these type of data. However,

conventional RNNs and LSTMs require us to ana-

lyze the entire timestep before we make a predic-

tion. So we adopt Dennis et al.s’ (Dennis et al., 2018)

operationalised implementation of Multiple Instance

Learning (MIL) with Early Stopping (EMI).

5 EARLY STOPPING - MULTIPLE

INSTANCE LEARNING

5.1 Multiple Instance Leaning

Multiple instance learning (MIL) is a semi-supervised

method of learning used in pattern recognition espe-

cially to train sparsely annotated sequential data for

classiﬁcation. It involves grouping sequential train-

ing records into batches called as bags (χ

). Each

bag is broken down into multiple overlapping sets

of instances of a constant width ω as seen in Fig 4.

These set of instances are called sub-instances (Z

i,τ

An entire bag is tagged with a singular label to rep-

resent its class. A set of sub-instances within the

bag that uniquely identiﬁes the class is deﬁned as

the class signature and is illustrated in Figure 4. We

choose a value k to represent the number of overlap-

ping set of sub-instances that encompasses the class

signature. These instances are considered positive in-

stances meanwhile the other instances in the bag are

considered as noise. Through MIL, we obtain a re-

duced training set by pruning bags to contain only

positively labeled instances (χ

). This increases the

chance of a higher prediction probability during the

testing phase.

Figure 4: Multiple Instance Learning.

For example, the act of pouring gasoline could

represent the class signature for a bag that is labeled

Arson. Hence we can categorize an entire bag as Ar-

son by recognizing the class signature - pouring gaso-

line. Figure 5 illustrates two visible class signatures

for the label Arson namely ”pouring gasoline” and

”ﬁre”.

5.1.1 Early Stopping

The early prediction implemented in our work serves

to improve a model’s inference time by stopping the

testing process early for a bag if the predicted proba-

bility for a class is greater than a desirable threshold

probability (conﬁdence level). This is performed iter-

atively for each instance in a test bag until the afore-

mentioned condition is met or until the end of the bag.

Figure 5: Class signature.

Since this algorithm works after MIL provides an al-

ready pruned set of bags, the conﬁdence level is sur-

passed within exponentially less number of iterations

and thus results in lesser inference time than normal.

6 METHODOLOGY

In this section, we elucidate the implementation of

Early Stopping and Multiple Instance Learning to de-

tect crime in surveillance videos. We penalize longer

inference times and larger model sizes since we aim

to optimize crime detection on edge devices. There-

fore, we make all our considerations regarding param-

eters and sampling rates to maximize testing accuracy

based on the aforementioned conditions.

Figure 6: Model Architecture.

6.1 Data Preprocessing

The sequence of images processed for the UCF-Crime

and Peliculas dataset is of the resolution 240×320

pixels and resized to half its size (120×160 pixels) in

order to reduce the computational overhead to obtain

a feature array for an image. By retaining the same

aspect ratio for frame resizing we reduce the HOF ex-

traction time for the video by half without incurring a

heavy loss in feature integrity. MIL requires a con-

stant set of variables that are a function of time (time

series). Therefore, we use the Dense representations

of HOF over variable-length Sparse representations of

HOF such as Spatio Temporal Interest Points (STIPs)

that provide an inconsistent number of STIPs across

frames (Laptev et al., 2008).

6.2 Undersampling

Frames per second (fps) is the sampling frequency of

the video under consideration. All UCF-Crime and

Peliculas videos are rendered in 30 frames per sec-

ond which we believe to be excessive for resource-

constrained HOF implementation. For our case, we

consider the possibility of interpreting the video by

sampling it at a Sub-Nyquist rate, i.e., frequency less

than the proposed value, with a reduction factor r as

given in Equation 5. Consequently, the number of

frames processed by MIL in a bag reduce by a fac-

tor of r as well. Reduction of fps with increase in r is

illustrated in Table 3. The implications of fps reduc-

tion are discussed in detail in Section 7.2.1 .

f =



, r ∈ N



(5)

where

is the ideal fps (or) sampling frequency

Table 3: Variation of Frames per Second with Reduction

Factor.

Reduction

Factor(r)

1 2 3 4 5 6

Frames per

second(fps)

30 15 10 7 6 5

6.3 EMI Parameters

6.3.1 Bag Size and Subinstance Width

After preprocessing the frames as 540 features of

dense HOF. We have to prepare the data to implement

Early Stopping - MIL by grouping extracted frames

into ﬁxed-length bags. The length of each bag is re-

ferred to as bag size. As discussed in Section 5.1,

each bag encompasses a class signature. Therefore

we select the bag size to be intuitively longer than the

length of most class signatures within the dataset. To

enable training we need to remove several frames at

the end of a video since it is less than the bag size. The

corresponding loss in training data due to removal is

calculated using Equation 6. Figure 7 gives an insight

on the duration for which an event occurs in a video.

Knowing the distribution of the number of frames per

video (Figure 7) assists us in determining the bag size

Figure 7: Distribution of Annotated Video Durations.

for maximizing the accuracy as well as reduce the

incurred data loss. On this basis, we choose a bag

size of 128 frames for UCF-Crime and 24 frames for

Peliculas.

L = F − n × T, n ∈ N (6)

where,

L is the incurred data loss,

F is the total number of frames in the video,

T is the bag size,

n is the total number of bags

A smaller portion of the bag called the subin-

stance [Section 5.1] is ﬁxed as 64 for UCF- Crime

dataset and 12 for Peliculas dataset.

6.4 EMI Implementation

Once the required parameters are set, the data is now

ready to be fed to the model. The three models that

are used with the concept of Multiple Instance Learn-

ing and Early Stopping are – EMI-LSTM, EMI-GRU,

and EMI-FastGRNN (Dennis et al., 2018) (Kusupati

et al., 2018). LSTMs and GRUs are the popular RNN

architectures that are used for the classiﬁcation of se-

quential points. FastRNN and FastGRNN (Kusupati

et al., 2018) were developed to satiate the inefﬁcien-

cies of RNN by employing residual connections. Due

to the weight matrix of FastGRNN being low rank,

sparse and quantized, it occupies less space compared

to other models.

7 RESULTS AND OBSERVATIONS

We deploy HOF feature extraction and EMI models

on the Raspberry Pi 3 Model B to prove the veracity

of their performance in realistic scenarios. The com-

putational environment of our work is illustrated in

Table 8.

7.1 Evaluation Metrics

Accuracy (F1 Score), Operation Time (seconds) and

Model Size (Mb) are considered as ideal metrics

of evaluation to collectively describe the efﬁciency

Table 4: UCF-Crime Optimization [L:EMI-LSTM; G:EMI-GRU; F:EMI-FastGRNN].

Reduction

Factor(r)

Frames per

second(fps)

Bag

Size

Sub-instance

width

Accuracy(%)

Inference

Time(s)

comp

L G F L G F L G F

1 30 128 64 95 95 86 7.16 7.6 6.87 - - -

2 15 64 32 94 93 83 3.82 3.97 3.44 46.6 47.7 49.9

4 7 32 16 88 85 75 1.97 1.99 1.98 72.5 73.8 71.2

6 5 21 10 91 89 78 1.36 1.37 1.30 77.8 81.9 81

Table 5: Peliculas Optimization [L:EMI-LSTM; G:EMI-GRU; F:EMI-FastGRNN].

Reduction

Factor(r)

Frames per

second(fps)

Bag

Size

Sub-instance

width

Accuracy(%)

Inference Time(s)

comp

L G F L G F L G F

1 30 24 12 99 99 97 1.654 1.678 1.652 - - -

2 15 12 6 97 95 90 0.852 0.825 0.824 48.3 51 50

4 7 6 3 99 99 93 0.412 0.413 0.412 75 75.4 75

of models running on resource-constrained devices.

Time refers to the sum of feature extraction time and

the EMI model inference time. Furthermore, we in-

troduce η

comp

to illustrate the computational savings

offered by our methods of optimization.

7.1.1 Computational Savings (η

comp

)

Computational Savings is indicative of the fraction of

optimized inference time to default.

comp

− I

× 100 (7)

where,

is the inference time using ideal fps for 1 bag,

is the inference time using reduced fps for 1 bag

7.2 Our Model

7.2.1 Undersampling Optimisation

We can observe how the reduction of frames (Section

6.2) affects the metrics of evaluation. When we con-

sider reduction factor (r ) ranging from 2-6 for UCF-

Crime, we observe an admissible decrease in accuracy

as illustrated in Table 4 (Average 6% across models)

with an average 32% increase in computational sav-

ings between r=2 and r=6. The average inference

time for maximum reduction r=6 across models for

UCF-Crime is 1.36 seconds. When reduction fac-

tor based optimisation was performed for the Pelic-

ulas Dataset (Section 3.2) the computational savings

shown in Table 5 increased by 25% between r=2 and

r=4. The lesser savings could be attributed to the

dataset’s low bag size.

7.3 Benchmarks

When we compare our model besides benchmarks es-

tablished for the datasets considered, we can evaluate

our model’s relative performance. Benchmark imple-

mentations include C3D (Tran et al., 2015) for UCF-

Crime and Fast Fight Detector (Gracia et al., 2015)

for the Peliculas dataset.

7.3.1 UCF-Crime

The EMI-LSTM, EMI-GRU, and EMI-FastGRNN

are implemented on UCF-Crime. Table 4 shows the

variation of model metrics with the reducing sampling

rate as given in Equation 5. Table 6 shows the met-

rics obtained for the dataset when the sampling rate

is reduced by a factor of 6. The FastGRNN model

occupies very little space being about 20 times lesser

than the benchmark model and is the fastest among

the models. LSTM offers the best accuracy.

Table 6: UCF-Crime Results.

Model

Accuracy

(%)

Inference

Time(s)

Size

(MB)

Benchmark 23.9 30 7.5

EMI-LSTM 91.3 1.36 1.3

EMI-GRU 89.1 1.31 1.0

EMI-FastGRNN 78.4 1.30 0.334

7.3.2 Peliculas

The EMI-LSTM, EMI-GRU, and EMI-FastGRNN

models are implemented for the Peliculas dataset, and

their performance is shown in Table 5. Table 7 shows

the metrics obtained for the dataset when the sam-

pling rate was reduced by a factor of 4. The EMI-

Table 7: Peliculas Results.

Model

Accuracy

(%)

Inference

Time(ms)

Benchmark 97.7 552

EMI-LSTM 99.2 412

EMI-GRU 99.0 413

EMI-FastGRNN 93.1 413

Table 8: Test Conditions.

Model Benchmark Ours

Processor Intel Xeon ARM v8

Number of cores 12 4

Processor Speed(GHz) 2.9 1.4

LSTM and EMI-GRU are very similar in their per-

formances but LSTM indicates an improved accuracy

score.

The test conditions of the benchmark and our

model are compared in Table 8. We implement our

solution with signiﬁcantly fewer resources than the

benchmark.

8 CONCLUSION

Multiple Instance Learning and Early Stopping con-

cepts (EMI) were applied on two real-world crime

detection datasets and the feature extraction for the

same was optimized to have a faster extraction time

by sampling videos at a Sub-Nyquist rate. The

proposed models surpassed existing benchmarks and

have proven capable of being deployable on resource-

constrained technologies connected to surveillance

cameras. We achieved a maximum accuracy of

91.3%, inference time of 1.3s and a minimum model

size of 0.334MB in the UCF-Crime dataset. As far

as the Peliculas dataset is concerned, we achieved an

accuracy of 99.2% and an inference time of 412 ms.

ACKNOWLEDGEMENTS

The authors would like to acknowledge Solarillion

Foundation for its support and resources provided for

the research work carried out.

REFERENCES

Agarwal, U. and Sabitha, A. S. (2016). Time series forecast-

ing of stock market index. In 2016 1st India Interna-

tional Conference on Information Processing (IICIP).

Alexandrie, G. (2017). Surveillance cameras and crime:

a review of randomized and natural experiments.

Journal of Scandinavian Studies in Criminology and

Crime Prevention, 18(2):210–222.

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-

Ortiz, J. L. (2012). Human activity recognition

on smartphones using a multiclass hardware-friendly

support vector machine. In International workshop on

ambient assisted living, pages 216–223. Springer.

Anguita, D., Ghio, A., Oneto, L., Parra, X., and Reyes-

Ortiz, J. L. (2013). A public domain dataset for human

activity recognition using smartphones. In Esann.

Brendel, W. and Todorovic, S. (2010). Activities as time

series of human postures. In European conference on

computer vision, pages 721–734. Springer.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset.

In proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 6299–6308.

Cui, Y., Li, Q., Nutanong, S., and Xue, C. J. (2019). Online

rare category detection for edge computing. In 2019

Design, Automation & Test in Europe Conference &

Exhibition (DATE), pages 1269–1272. IEEE.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection.

Dennis, D., Pabbaraju, C., Simhadri, H. V., and Jain, P.

(2018). Multiple instance learning for efﬁcient se-

quential data classiﬁcation on resource-constrained

devices. In Advances in Neural Information Process-

ing Systems, pages 10953–10964.

Farneb

ack, G. (2003). Two-frame motion estimation based

on polynomial expansion. In Scandinavian conference

on Image analysis, pages 363–370. Springer.

Girdhar, R., Carreira, J., Doersch, C., and Zisserman, A.

(2019). Video action transformer network. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 244–253.

Gracia, I. S., Suarez, O. D., Garcia, G. B., and Kim,

T.-K. (2015). Fast ﬁght detection. PloS one,

10(4):e0120448.

Gupta, C., Suggala, A. S., Goyal, A., Simhadri, H. V.,

Paranjape, B., Kumar, A., Goyal, S., Udupa, R.,

Varma, M., and Jain, P. (2017). Protonn: compressed

and accurate knn for resource-scarce devices. In Pro-

ceedings of the 34th International Conference on Ma-

chine Learning-Volume 70, pages 1331–1340. JMLR.

org.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In The IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Kumar, A., Goyal, S., and Varma, M. (2017). Resource-

efﬁcient machine learning in 2 kb ram for the internet

of things. In Proceedings of the 34th International

Conference on Machine Learning-Volume 70, pages

1935–1944. JMLR. org.

Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P.,

and Varma, M. (2018). Fastgrnn: A fast, accurate, sta-

ble and tiny kilobyte sized gated recurrent neural net-

work. In Advances in Neural Information Processing

Systems, pages 9017–9028.

Laptev, I., Marszałek, M., Schmid, C., and Rozenfeld,

B. (2008). Learning realistic human actions from

movies.

Lucas, B. D., Kanade, T., et al. (1981). An iterative image

registration technique with an application to stereo vi-

sion.

Meng, W., Gu, Z., Zhang, M., and Wu, Z. (2017). Two-bit

networks for deep learning on resource-constrained

embedded devices. arXiv preprint arXiv:1701.00485.

Nancy G. La Vigne, Samantha S. Lowry, J. A. M. and

Dwyer, A. M. (2015). Evaluating the use of pub-

lic surveillance cameras for crime control and preven-

tion—a summary.

Nievas, E. B., Suarez, O. D., Garc

ıa, G. B., and Sukthankar,

R. (2011). Violence detection in video using com-

puter vision techniques. In International conference

on Computer analysis of images and patterns, pages

332–339. Springer.

Pearson, K. (1901). Liii. on lines and planes of closest ﬁt to

systems of points in space. The London, Edinburgh,

and Dublin Philosophical Magazine and Journal of

Science, 2(11):559–572.

Piza, E. L., Welsh, B. C., Farrington, D. P., and Thomas,

A. L. (2019). Cctv surveillance for crime prevention.

Criminology & Public Policy, 18(1):135–159.

Pratt, L. Y. (1993). Discriminability-based transfer between

neural networks. In Advances in neural information

processing systems, pages 204–211.

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

In Advances in neural information processing sys-

tems, pages 568–576.

Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101:

A dataset of 101 human actions classes from videos in

the wild. arXiv preprint arXiv:1212.0402.

Sultani, W., Chen, C., and Shah, M. (2018). Real-world

anomaly detection in surveillance videos. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 6479–6488.

Tay, N. C., Connie, T., Ong, T. S., Goh, K. O. M., and Teh,

P. S. (2019). A robust abnormal behavior detection

method using convolutional neural network. In Al-

fred, R., Lim, Y., Ibrahim, A. A. A., and Anthony,

P., editors, Computational Science and Technology,

pages 37–47, Singapore. Springer Singapore.

Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri,

M. (2015). Learning spatiotemporal features with 3d

convolutional networks. In Proceedings of the IEEE

international conference on computer vision, pages

4489–4497.

UNODC (2017). World crime trends and emerging is-

sues and responses in the ﬁeld of crime prevention

and criminal justice. In 26th Edition, Commission on

Crime Prevention and Criminal Justice.

Warden, P. (2018). Speech commands: A dataset

for limited-vocabulary speech recognition. CoRR,

abs/1804.03209.

Zhu, Y. and Newsam, S. (2019). Motion-aware feature for

improved video anomaly detection.