movie or detecting specific occurrences rather than
implementing a searching strategy. Many approaches
aim to summarise movies by detecting scene bound-
aries and creating a scene based index. For exam-
ple, (Yeung and Yeo, 1996; Yeung and Yeo, 1997;
Rui et al., 1998; Zhou and Tavanapong, 2002) all
aim to cluster similar shots together, and locate ar-
eas where one set of clusters does not relate to any
previous clusters. Other approaches, such as (Li and
Kou, 2003; Rasheed and Shah, 2003; Sundaram and
Chan, 2000), use the concept of shot coherence to lo-
cate scene boundaries. In general, this involves locat-
ing areas where fundamental changes in audiovisual
features occur. However, a scene based index does
not carry any semantic meaning, and it is difficult to
locate a sought part of a movie with scene bound-
ary information alone, unless each individual scene
is viewed.
Many other movie summarisation approaches fo-
cus on detecting individual event types from the
video. (Leinhart et al., 1999) detect dialogues in
video based on the common shot/reverse shot shoot-
ing technique, which results in detectable repeating
shots. This approach however, is only applicable to
dialogues involving two people, since if three or more
people are involved the shooting structure will be-
come unpredictable. Also, there are many other event
types in a movie or television program apart from di-
alogues. (Li and Kou, 2003; Li and Kou, 2001) ex-
pand on this idea to detect three types of events, 2-
person dialogues, multi-person dialogues and hybrid
events (where a hybrid event is everything that isn’t
a dialogue). However, only dialogues are treated as
meaningful events and everything else is declared as
a hybrid event. (Chen et al., 2003) aim to detect both
dialogue and action events in a movie, however the
same approach is used to detect both types of events,
and the type of action events that are detected is re-
stricted. (Nam et al., 1998) detect violent events in
a movie by searching for visual cues such as flames
or blood pixels, or audio cues such as explosions or
screaming. This approach, however, is quite restricted
as there may be violent events that do not contain their
chosen features. (Kang, 2003) extract a set of colour
and motion features from a video, and then use rele-
vance feedback from individual users to class events
into ‘fear’, ‘sadness’, or ‘joy’. (Zhai et al., 2004) gen-
erate colour, motion and audio features for a video,
and then use finite state machines to class scenes into
either conversation scenes, suspense scenes or action
scenes. However, this approach relies on the pres-
ence of known scene breaks, and classifies a whole
scene into one of the categories, while in reality an en-
tire scene may contain a number of important events.
Previous work by the authors (Lehane and O’Connor,
2006), created an event based index of an entire movie
by devising a set of event classes that cover all of the
meaningful events in a movie. By detecting each of
the events in an event class, the entire movie is in-
dexed.
The object of the searching method proposed in this
paper is to retrieve sought events a movie. An event is
something which progresses the story onward. Events
are the portions of a movie which viewers remember
as a semantic unit after the movie has finished. A
conversation between a group of characters, for exam-
ple, would be remembered as a semantic unit ahead of
a single shot of a person talking in the conversation.
Similarly, a car chase would be remembered as ‘a car
chase’, not as 50 single shots of moving cars. A single
shot of a car chase carries little meaning when viewed
independently, it may not even be possible to deduce
that a car chase is taking place from a single shot,
however, when viewed in the context of the surround-
ing shots in the event, its meaning becomes apparent.
Events are components of a single scene, and a scene
may contain a number of different events. For exam-
ple, a scene may contain a conversation, followed by
a fight, which are two distinct events. Similarly, there
may be three different conversations (between three
sets of people) in the same scene, corresponding to
three different events. The searching system in this
paper aims to return events to a user, rather than spe-
cific shots. This allows for efficient retrieval.
Previous work by the authors focused on the detec-
tion of specific events in movies. (Lehane et al., 2005)
proposed a method of detecting dialogue events,
while (Lehane et al., 2004a) proposed an exciting
event detection method. The approach in (Lehane and
O’Connor, 2006) built upon previous work in order to
generate a complete event-based summary of a movie.
In each of the approaches above, a set of audiovi-
sual features were generated, and finite state machines
were utilised in order to detect the relevant events.
Each of the approaches examined film grammar prin-
ciples in order to assist in the event detection process.
The work presented in this paper extends upon the
event detection structure in order to allow user spec-
ified searching. This allows a human browser to re-
trieve events based on their requirements, rather than
on a predefined structure. As this paper presents an
extension of previous work, the explanation of por-
tions of the system design elements is reduced, in par-
ticular the feature generation process.
The presented approach only utilises audiovisual
searching. No textual information is used, and the re-
sults are solely based on the extracted features. Sec-
tion 2 describes the feature generation part of the
search system, which includes the feature selection
process, and the creation of a shot-level feature vector.
Section 3 describes the actual search method. This is a
two step process. Firstly, finite state machines (FSMs)
are utilised in order to generate a set of sequences, and
secondly, a layer of filtering is undertaken and a set
SEARCHING MOVIES BASED ON USER DEFINED SEMANTIC EVENTS
233