PRE-FIGHT DETECTION

Classiﬁcation of Fighting Situations using Hierarchical AdaBoost

Scott J. Blunsden

School of Computing, University of Dundee, Dundee, U.K.

Robert B. Fisher

IPAB, School of Informatics, University of Edinburgh, Edinburgh, U.K.

Keywords:

Fight, Pre-ﬁght, Cuboid, AdaBoost.

Abstract:

This paper investigates the detection and classiﬁcation of ﬁghting and pre and post ﬁghting events when viewed

from a video camera. Speciﬁcally we investigate normal, pre, post and actual ﬁghting sequences and classify

them. A hierarchical AdaBoost classiﬁer is described and results using this approach are presented. We show

it is possible to classify pre-ﬁghting situations using such an approach and demonstrate how it can be used in

the general case of continuous sequences.

1 INTRODUCTION

This paper investigates pre-ﬁghting situations as

viewed from a video camera within a surveillance do-

main. The main aim is to establish the feasibility of

detecting ﬁghting situations. Additionally we are also

interested in investigating the possibility of detecting

and classifying pre and post-ﬁghting situations. Pre-

ﬁghting is useful in surveillance situations where the

timely intervention of a CCTV operator could avoid a

potentially criminal situation and thus prevent an es-

calation of violence.

Within this paper we make use of Dollar et al’s

(Dollar et al., 2005) spatio-temporal features to con-

struct a sequence representation. A hierarchical ver-

sion of AdaBoost is then used to classify the se-

quences. We demonstrate a classiﬁer which gives

classiﬁcation performance of 95% when classifying

ﬁghting vs non ﬁghting situations on the BEHAVE

dataset.

2 PREVIOUS WORK

Human ability to predict dangerous or criminal ac-

tivities from CCTV has previously been investigated

by Troscianko et al. (Troscianko et al., 2004). In

their work participants from either an expert or a non-

expert group were shown videos from CCTV cam-

eras. At a particular point in time the video was

paused and the participants were asked to predict on a

scale of 1 to 5 if they thought a dangerous act would

be committed by an individual or individuals in the

video. Human performance classiﬁed 80% of crimi-

nal incidents correctly with 65% of normal but similar

incidents matched correctly. Dee and Hogg (Dee and

Hogg, 2004) also investigate human performance us-

ing a computational model and found correlations in

the rating of ’interestingness’.

The most similar work to ours is that of Datta et

al. (Datta et al., 2002) who detect person on per-

son violence using a range of measures derived from

a background removed and segmented representation

of the person. The measures include acceleration and

jerk along with the leg and arm orientations. All are

computed from a side on point of view and results

indicate good performance on their dataset of 62 situ-

ations with a correct classiﬁcation of 97%.

Cupillard et al. (Cupillard et al., 2002) also inves-

tigate ﬁghting situations within the domain of Metro

surveillance. They use pre-deﬁned templates of ac-

tivity to match the on screen activity and classify the

image sequence.

Ribeiro et al. (Ribeiro and Santos-Victor, 2005)

also attempt to classify what a person is doing within

the CAVIAR dataset using a hierarchical feature se-

lection method. Others such as Davis and Bobick

(Davis and Bobick, 2001) used moments based upon

303

J. Blunsden S. and B. Fisher R.

PRE-FIGHT DETECTION - Classiﬁcation of Fighting Situations using Hierarchical AdaBoost.

DOI: 10.5220/0001775903030308

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications (VISIGRAPP 2009), page

ISBN: 978-989-8111-69-2

a stabilised silhouette image to classify more general

motion. Efros et al. (Efros et al., 2003) used a opti-

cal ﬂow based similarity measure to match different

persons actions.

3 FEATURES

We make use of Dollar et al.’s (Dollar et al., 2005) ap-

proach to sequence representation as it has been pre-

viously successful (Niebles et al., 2006; Dollar et al.,

2005), can deal with occlusions. Background sub-

traction was not used as it gave inconsistent results

on these sequences. The method is brieﬂy reviewed

here.

Dollar et al. (Dollar et al., 2005) developed a

spatio-temporal response function for classifying se-

quences of behaviours. Their approach assumes a sta-

tionary camera (or that the effects of camera motion

can be compensated for). The response function is

given in equation (1).

R = (I ⊗ g ⊗h

)

+ (I ⊗ g ⊗ h

)

(1)

The 2D smoothing Gaussian function g(x,y, σ)

is applied only along the spatial dimensions of the

image sequence I. The two functions h

and

are a pair of Gabor ﬁlters which are deﬁned

as h

(t; τ,ω) = −cos(2πtω)e

−t/τ

and h

(t; τ,ω) =

−sin(2πtω)e

−t

/τ

. They are applied along the tem-

poral dimensions of the image sequence. Throughout

all experiments we set ω = 4/τ. This gives the re-

sponse function two parameters corresponding to the

spatial scale (σ) and the temporal scale (τ). They were

set to (τ = 3,σ = 3) throughout all experiments. This

follows on from work by Dollar (Dollar et al., 2005)

and separately Niebles et al. (Niebles et al., 2006)

who found that the 3 × 3 ×3 spatial and temporal res-

olution was sufﬁcient for action recognition. Only

those responses above a threshold value are recorded.

Figure 1: The scaled original image (top row) along with

the corresponding response image (R - equation 1) bottom

row.

From these response functions a cuboid descriptor

is formed. This is a three dimensional cuboid formed

from the original image sequence in space and time. It

consists of all (greyscale) pixel values within an area

Figure 2: Examples of sequences along with the corre-

sponding histogram representation. The histograms are

computed over the entire sequence. Top is a ﬁghting se-

quence, bottom left is a normal sequence and bottom right

is a post ﬁghting sequence. The ﬁxed size histograms are

composed from the whole complete sequence, whose length

can vary. The histograms are normalised to unit weight

of six times the scale at which it was detected. Only

those regions where the response is above a certain

threshold are used.

3.1 Sequence Representation

Each sequence generates a set of cuboids (as detailed

in section 3). Each pre-identiﬁed class (ﬁghting, pre-

ﬁght, post ﬁght and normal) generates a large num-

ber of cuboids over all sequences. From this large

number of cuboids a smaller set is sub-sampled (us-

ing random sampling) so that these cuboids can be

clustered. K-means clustering (Duda et al., 2000) is

used to identify k cluster centres (using the Euclidean

distance as a similarity metric). Clustering was per-

formed per class (k=10, giving a total dictionary size

of 40, set empirically), with the ﬁnal dictionary con-

sisting of all clusters concatenated.

For each sequence a histogram is created based

upon the previously learned cluster centres. The re-

sponse function (equation 1) is applied throughout the

complete sequence. Cuboids are then generated from

the complete sequence as described in section 3.

For each cuboid in the new sequence the near-

est cluster within the learned cluster centre dictionary

is found. A histogram is then made of all matches

throughout the sequence. This histogram is then nor-

malised. Examples of image sequences and their cor-

responding histograms are given in ﬁgure 2.

In addition to the histogram the features, as de-

scribed in section 3, are included in the sequence rep-

resentation. This gives a ﬁnal representation (S

) of

sequence i :



| d



(2)

is the i

histogram and d

the distance the per-

son has moved for the i

sequence. The mean (R

)

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

304

and standard deviation (R

) of the response image se-

quence (R ) for the i

sequence make up the other

additional features.

4 CLASSIFICATION

Here AdaBoost (Freund and Schapire, 1996) is used

to classify each sequence based upon the sequence

representation. The implementation of AdaBoost

uses a decision tree classiﬁer as a weak learner.

The approach also differs from a standard Ad-

aBoost classiﬁer in that we employ a hierarchical clas-

siﬁcation method. Such a hierarchy is preferable to

multiclass AdaBoost (such as that used by Zhu et al.

(Zhu et al., 2006) ) as we are trying to discover the

structure of events.

4.1 Hierarchy

To discover the best structure (in terms of classiﬁca-

tion performance) the set of P possible hierarchical

partitions of the classes was created. At each level

within the hierarchy we look at all possible partitions

of the binary class labels. The number of possible par-

titions at a particular leaf is given in equations (3) and

(4) :

f (N, k) =











/2 i f , k =





otherwise

(3)

(N/2)

∑

k=1

f (N, k) (4)

N is the number of possible classes (in our case

totaling four). The case where k =

removes mirror

partitions (ie partitions which are the same but sim-

ply swapped between the right and left side) from the

set of partitions. For this four class problem there are

= 7 initial possible partitions : (([1][2 3 4]),([1

2][3 4]),([1 3][2 4]),([1 4][2 3]),([2][1 3 4]),([3][1 2

4]),([4][1 2 3])). Another partition is calculated at ev-

ery node of the tree from the classes assigned to that

node until each node has only one class.

The hierarchical model starts with a set of all pos-

sible partitions P of the set of all class labels (C

) at

the current node n. Each of these partitions (p

) has a

left (l) and right (r) branch such that:

= {l

} (5)

⊂ C

(6)

= C

(7)

5 RESULTS

5.1 Classiﬁcation of Complete

Sequences

These experiments are similar in spirit to those of

Troscianko et al. (Troscianko et al., 2004) who tested

human ability to detect dangerous situations by us-

ing complete pre-segmented sequences prior to ask-

ing the question: what happens next? Here the com-

plete test sequences of varying lengths are used to

test the algorithm’s performance. First the ques-

tion of optimal dictionary size is investigated. The

best performing dictionary size is then used to clas-

sify whole sequences and results are presented and

discussed. We use two publicly available datasets

to test the method. First we use the small scale

CAVIAR dataset (Project/IST 2001 37540, 2004) be-

fore also demonstrating the approach upon the BE-

HAVE dataset (Blunsden et al., 2007).

The datasets were manually labelled into 4

classes: Pre-ﬁght, post ﬁght ﬁghting and no-ﬁghting.

These classes were manually labelled by members of

a computer vision surveillance lab.

Results on the BEHAVE Dataset. The classiﬁca-

tion tree was constructed by ﬁrst separating the train-

ing and test data into two distinct and equal sized sets.

The data was separated per sequence so that training

samples were not taken from the same sequence as

those used for testing. The best tree as determined by

our method over a number of runs is given in ﬁgure 3.

Confusion matrices for this tree are given in table 1.

Figure 3: The ﬁnal classiﬁcation tree. Shaded nodes show

the classes from which partitions of the data are formed.

This tree gives an overall classiﬁcation perfor-

mance of 89.9% correct classiﬁcation with a standard

deviation over multiple runs of 0.019. The confu-

sion matrices for classifying individual classes and all

ﬁghting behaviour as one is given in ﬁgure 1(b). For

normal vs ﬁghting behaviour correct classiﬁcation is

at 96%. The structure groups post and pre ﬁght be-

haviour together suggesting that there is a high degree

of similarity between them.

PRE-FIGHT DETECTION - Classification of Fighting Situations using Hierarchical AdaBoost

305

Table 1: Confusion matrix for classiﬁcation of sequences.

(a) Shows the performance treating each class individually

whilst (b) shows results with all ﬁghting behaviour aggre-

gated. Results are for the BEHAVE dataset.

True

Fight Pre-Fight Post Fight Normal

Fight 0.96 0.02 0.06 0.02

Classiﬁed Pre-Fight 0.04 0.88 0.08 0.01

Post-Fight 0 0.08 0.78 0.01

Normal 0 0.02 0.08 0.96

(a)

True

Fighting Related Normal

Fighting Related 0.96 0.04

Classiﬁed Normal 0.04 0.96

(b)

When grouping all ﬁghting based behaviour to-

gether the performance increases substantially. It is

useful to show performance for such normal vs non-

normal behaviour as there are many applications to

surveillance situations. The cases where a ﬁghting

situation is classiﬁed as normal is relatively low with

much of the confusion arising between pre, post and

actual ﬁghting.

Results on the CAVIAR Dataset. For the smaller

dataset the results are also promising (see table 2).

However it should be noted that the number of ﬁght-

ing examples is signiﬁcantly less then examples from

the BEHAVE dataset. Again when grouping all the

ﬁghting situations together (pre/post and actual ﬁght-

ing) the results improve signiﬁcantly. None of the

ﬁghting situations are confused with a normal situa-

tion. Overall performance is 89.3% with again a very

small standard deviation of 0.1. The overall accuracy

rises to 92.9% when considering all ﬁghting vs no

ﬁghting situations. The tree retains the same structure

as the one above and so is not reproduced here.

However some normal situations are misclassiﬁed

as a ﬁght situation. This is perhaps to do with some

of the ﬁghting scenes being acted out rather than be-

ing actual ﬁghts. Some of the scenes where a peo-

ple are walking together and meeting one another can

look similar to ﬁghting scenes within this dataset. It

is often the pre and post ﬁght behaviour which also

helps to identify a ﬁght something which the normal

sequences do not display.

5.2 Labeling of Continuous Sequences

A further experiment was conducted whereby se-

quences were not pre-segmented but instead a con-

tinuous video stream was presented to the classiﬁer.

This task is much harder than using pre-segmented

sequences due to the high degree of overlap between

Table 2: Confusion matrix for classiﬁcation of sequences

for the CAVIAR dataset. (a) Shows the performance treat-

ing each class individually whilst (b) shows results with all

ﬁghting behaviour aggregated.

True

Fight Pre-Fight Post Fight Normal

Fight 1 0 0 0.11

Classiﬁed Pre-Fight 0 1 0.2 0

Post-Fight 0 0 0.8 0

Normal 0 0 0 0.89

(a)

True

Fighting Related Normal

Fighting Related 1 0.11

Classiﬁed Normal 0 0.89

(b)

different classes as they transition from one to the

other.

In order to continuously classify each frame a win-

dow around the current frame was used to provide the

features which the classiﬁer used. This approach is

shown in ﬁgure 4. The reason a window around the

current frame to classify is used is to help with lag

when the activity changes. Whole sequences were

again divided into training and testing with the results

of classifying only the test set are presented. By di-

viding up complete sequences rather than only frames

we ensure we are classifying data rather than interpo-

lating it.

Figure 4: Construction of the histograms for continuously

labeling all frames in the video. The current frames (high-

lighted) histogram is made up of cuboid centres from within

a speciﬁed window (in this case 50 frames either side).

Every other step of the algorithm stayed the same,

except that the features are derived from a ﬁnite win-

dow around the current frame. This gave a vast in-

crease of the number of samples. For the BEHAVE

dataset there are 31094 samples of size ±50 frames to

classify (vs 1138 complete sequences, as in the previ-

ous section). The CAVIAR dataset gives 3094 indi-

vidual frames to classify (vs 56 complete sequences).

First we investigate what window size it is appropriate

to use.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

306

5.2.1 Classiﬁcation Results

Behave Dataset. The best result when using this

method on the BEHAVE sequence gave an overall

classiﬁcation performance of 89%. Again this rose

to 92% when only ﬁghting vs normal behaviour was

considered. Confusion matrices for classiﬁcation of

continuous video data on the BEHAVE dataset is

given in ﬁgure 3.

Table 3: Confusion matrices for the BEHAVE dataset con-

tinuous sequences at a window size of 90. (a) shows per

class performance whilst (b) shows the results of aggregat-

ing ﬁghting behaviour together.

True

Fight Pre-Fight Post Fight Normal

Fight 0.67 0.38 0.32 0.05

Classiﬁed Pre-Fight 0.17 0.2 0 0.01

Post-Fight 0.01 0 0.68 0.01

Normal 0.15 0.42 0 0.93

(a)

True

Fighting Related Normal

Fighting Related 0.81 0.07

Classiﬁed Normal 0.19 0.93

(b)

An example of classiﬁcation is given below in ﬁg-

ure 5.

Figure 5: Predicted actions for the individual shown in the

red box. The numbers in parenthesis refer to the frame num-

bers. Here Class 1 is ﬁghting, 2 pre-ﬁghting, 3 post ﬁghting

and 4 is for a normal situation. Around frame 54,935 the

individual slowly breaks away from the ﬁghting. This may

explain the errors around this time, it looks very similar to

a group splitting up. There is a slight prediction delay be-

tween ﬁghting and post-ﬁght behaviour of running away.

This is down to using a window around the current frame,

thus basing the classiﬁcation on some portion of the past,

coupled with the uncertainty as event change.

When classiﬁcation is performed in this manner

parts of the sequences are misclassiﬁed as being nor-

mal when they are not. The lower number of ex-

amples when using a 100 window size for post-ﬁght

sequences is due to the short timescale upon which

they happen (ie there are not as many post ﬁght sit-

uations of 100 frames in length). A future improve-

ment will be to construct the histograms to adapt their

length based upon the video information available.

The switching between normal and ﬁghting frames is

due to the similarity in their appearance over a rela-

tively short timescale.

CAVIAR Dataset. Results for the CAVIAR dataset

are given in ﬁgure 4. For this dataset the results are

not as good. Fighting and pre-ﬁghting are frequently

confused with normal behaviour. This may have to

do with the very small number of ﬁghting examples

contained within this dataset coupled with the very

short time span. When watching pre and post ﬁghting

behaviour some of the examples have less purpose-

ful movement and speed than those contained in the

BEHAVE sequences and real ﬁghts.

Table 4: Confusion matrices for continuous sequences. (a)

shows per class performance whilst (b) shows the results of

aggregating ﬁghting behaviour together. CAVIAR dataset.

Results are for a window size of 45.

True

Fight Pre-Fight Post Fight Normal

Fight 0.07 0 0 0.092

Classiﬁed Pre-Fight 0 0 0.03 0.004

Post-Fight 0 0.11 0.32 0.004

Normal 0.92 0.88 0.64 0.9

(a)

True

Fighting Related Normal

Fighting Related 0.24 0.1

Classiﬁed Normal 0.76 0.9

(b)

6 CONCLUSIONS AND FUTURE

WORK

The major contribution this paper has addressed is

that of investigating the feasibility of identifying pre-

ﬁght situations. The ability to identify when a ﬁght

is likely to break out is useful in surveillance applica-

tions as it may be possible to intervene to stop a crime

occurring or at least identify such situations at the ear-

liest possible opportunity to allow useful intervention.

The role of identifying post ﬁghting behaviour is also

of use as there may be some areas which CCTV cam-

eras do not cover. They may only witness the end of

a ﬁght but it may be important to send assistance to

this area in an effort to help victims and stop further

criminal acts occurring.

The second major contribution is in publishing

results on publicly available datasets. Such trans-

PRE-FIGHT DETECTION - Classification of Fighting Situations using Hierarchical AdaBoost

307

parency is important in order to establish how well

algorithms work in comparison to others.

This paper has presented a way to classify ﬁghting

situations. Our method gives 96% correct classiﬁca-

tion on the BEHAVE dataset compared to Datta et al.

(Datta et al., 2002) who reported 97% and Cupillard

et al. (Cupillard et al., 2002) who report 95% for de-

tection of ﬁghting situations on other (and separate)

datasets. However our method does not require the

pre segmentation of parts of individuals, foreground

extraction or pre compiled behaviour models. It has

also been demonstrated that it is possible to identify

pre and post-ﬁght situations. Such cases are important

to monitoring situations as intervention before the act

is always preferable.

A hierarchical classiﬁer is useful in many surveil-

lance applications. Using such a structure can visu-

ally show you how the classiﬁcation algorithm per-

ceives the features which are given to it. This can be

useful as a sanity check to make sure that the method

is grouping things as you expect them to be.

However it is felt the most useful aspect of using

a hierarchical classiﬁer is in the ability to subdivide

behaviours into a ﬁner degree of granularity. For ex-

ample in a surveillance application one may wish to

identify all the ﬁghting situations (as we have done

here) and then obtain further granularity so as to iden-

tify pre and post ﬁght situations as we have shown.

This ability is useful as it can allow a ﬁne tuning of a

surveillance system.

One issue raised here is that of overlapping

classes. It has been shown that when all the ﬁght-

ing classes are combined the accuracy increases. The

question of are the classes truely different or rather

just transitional states between normal and ﬁght-

ign behaviour. To investigate this an unsupervised

method could be used. However it may still be use-

ful to be able to distinguish the point before a ﬁght (

eg before someone got hurt) in order to stop physical

injury occuring.

Future work should seek to improve the classiﬁca-

tion of continuous sequences perhaps by incorporat-

ing temporal models (eg, hidden Markov models) to

improve classiﬁcation. A further extension would be

to remove the manual tracking component altogether

(although some targets will be temporarily lost), or to

combine individuals into group actions.

ACKNOWLEDGEMENTS

Thanks to Piotr Dollar for kindly making his cuboids

code available. This work is funded by EPSRCs BE-

HAVE project GR/S98146.

REFERENCES

Blunsden, S., Andrade, E., Laghaee, A., and

Fisher, R. (2007). Behave interactions

test case scenarios, epsrc project gr/s98146,

http://groups.inf.ed.ac.uk/vision/behavedata/interacti-

ons/index.html. On Line.

Cupillard, F., Bremond, F., and Thonnat, M. (2002). Group

behavior recognition with multiple cameras. In Sixth

IEEE Workshop on Applications of Computer Vision

(WACV).

Datta, A., Shah, M., and Lobo, N. D. V. (2002). Person-on-

person violence detection in video data. In Proceed-

ings of the 16 th International Conference on Pattern

Recognition (ICPR’02) Volume 1, page 10433. IEEE

Computer Society.

Davis, J. W. and Bobick, A. F. (2001). The representation

and recognition of action using temporal templates. In

IEEE Transactions on Pattern Analysis and Machine

Intelligence, volume 23, pages 257–267. IEEE Com-

puter Society.

Dee, H. and Hogg, D. C. (2004). Is it interesting? com-

paring human and machine judgements on the pets

dataset. Sixth International Workshop on Performance

Evaluation of Tracking And Surveillance, 33(1):49–

55.

Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005).

Behavior recognition via sparse spatio-temporal fea-

tures. In PETS, pages 65–72, China.

Duda, R., Hart, P. E., and Stork, G. D. (2000). Pattern Clas-

siﬁcation, Second Edition. Wiley Interscience, Uni-

versity of Texas at Austin, Austin, USA.

Efros, A., Berg, A., Mori, G., and Malik, J. (2003). Recog-

nising action at a distance. In In 9th International

Conference on Computer Vision, volume 2, pages

726–733.

Freund, Y. and Schapire, R. E. (1996). Game theory, on-line

prediction and boosting. In Ninth Annual Conference

on Computational Learning Theory, pages 325–332.

Niebles, J. C., Wang, H., and FeiFei, L. (2006). Unsu-

pervised learning of human action categories using

spatial-temporal words. In British Machine Vision

Conference, Edinburgh.

project/IST 2001 37540, E. F. C. (2004). found at url:

http://homepages.inf.ed.ac.uk/rbf/caviar/.

Ribeiro, P. and Santos-Victor, J. (2005). Human activi-

ties recognition from video: modeling, feature selec-

tion and classiﬁcation architecture. In Workshop on

Human Activity Recognition and Modelling (HAREM

2005 - in conjunction with BMVC 2005), pages 61–70,

Oxford.

Troscianko, T., Holmes, A., Stillman, J., Mirmehdi, M.,

and Wright, D. (2004). What happens next? the pre-

dictability of natural behaviour viewed through cctv

cameras. Perception, 33(1):87–101.

Zhu, J., Rosset, S., Zhou, H., and Hastie, T. (2006). Multi-

class adaboost. Technical report, University of Michi-

gan, Ann Arbor.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

308