ing contextual dynamics and learning a scene model.
In (Chasanis et al., 2009), shots are represented by
means of key-frames, clustered with spectral cluster-
ing, and then labeled according to the clusters they
belong to. Scene boundaries are then detected from
the alignment score of the symbolic sequences.
In graph-based methods, instead, shots are ar-
ranged in a graph representation and then clustered
by partitioning the graph. The Shot Transition Graph
(STG), proposed in (Yeung et al., 1995), is one of the
most used models in this category: here each node
represents a shot and the edges between the shots
are weighted by shot similarity. In (Rasheed and
Shah, 2005), color and motion features are used to
represent shot similarity, and the STG is then split
into subgraphs by applying the normalized cuts for
graph partitioning. More recently, Sidiropoulos et
al. (Sidiropoulos et al., 2011) introduced a new STG
approximation that exploits features extracted from
the visual and the auditory channel.
3 VIDEO ANALYSIS
Videos can be decomposed at three different granu-
larity levels: frames, shots and scenes. A video, in-
deed, is an ordered set of frames; sequences of adja-
cent frames taken by a single camera compose a shot,
and two consecutive shots can be spaced out by a tran-
sition, which is in turn a set of frames. Finally, sets
of contiguous and semantically coherent shots form a
scene.
Since scenes are sets of shots, the first step in
scene detection is the identification of shot bound-
aries. We propose a shot segmentation approach that
assures high accuracy levels, while keeping execu-
tion times low. Our method identifies shot bound-
aries by means of an extended difference measure,
that quantifies the change in the content of two differ-
ent positions in the video. Shots are then grouped into
scenes with a clustering approach that includes tem-
poral cues. We also describe a solution to sort key-
frames from a scene to let the user select the level
of detail of a scene summary. Finally, every shot is
enriched by a number of tags, automatically detected
on the selected keyframes, using the API provided by
Clarifai, Inc.
1
.
3.1 Shot Boundaries Detection
Given two consecutive shots in a video sequence, the
first one ending at frame e, and the second one start-
ing at frame s, we define the transition length as the
1
https://developer.clarifai.com/docs/
number of frames in which the transition is visible,
L = s − e − 1. An abrupt transition, therefore, is a
transition with length L = 0. The transition center,
n = (e +s)/2, may correspond to a non-integer value,
that is an inter-frame position: this is always true in
case of abrupt transitions.
Having selected a feature F to describe frames in
a video, we define the extended difference measure
M
w
n
, centered on frame or half-frame n, with 2n ∈ N,
and with a frame-step 2w ∈ N, as
M
n
w
=
d(F(n − w), F(n + w)), if n + w ∈ N
1
2
M
n−
1
2
w
+ M
n+
1
2
w
, otherwise
(1)
where d(F(i),F( j)) is the distance between frames
i and j, computed in terms of feature F. The second
term of the expression is a linear interpolation adopted
for inter-frame positions. This is necessary because
feature F is relative to a single frame and cannot be
directly computed at half-frames. The selected fea-
tures should have the property to be almost constant
immediately before and after a transition, and to have
a constant derivative during a linear transition.
The algorithm starts by thresholding the M
n
w
val-
ues at all frames and half frames positions with w =
0.5. This gives a set of candidate positions for transi-
tions. Two operations are then needed: merging and
validation. Merging is the aggregation of adjacent
candidate positions, which provides a list of candi-
date transitions C = {t
i
= ( f
i
,l
i
)}, where f
i
is the first
position of the transition, and l
i
is the last position.
These may be real transitions (most likely hard cuts),
or false positives, i.e. shots with high level differences
due to motion. A validation step is then performed
to prune false positives, by measuring the transition
Peak value, which is defined as:
Peak
w
(t) = max
f ≤n≤l
(M
n
w
) − min(M
f −2w
w
,M
l+2w
w
) (2)
Peak
w
(t) measures the variation in difference values
between the transition and the adjacent shots. In or-
der to validate the transition, therefore, a significant
variation must be observed on at least one side of the
candidate transition.
To detect gradual transitions, previous steps are
repeated at increasing values of w. This would pos-
sibly cause other positions to surpass the threshold
value, thus changing and eventually invalidating pre-
viously found transitions. For this reason, every vali-
dated transition is protected by a safe zone: only po-
sitions between previous transitions with distance su-
perior to a certain number of frames are further ana-
lyzed.
In total four parameters need to be set up for our
algorithm: T , the threshold on difference levels; T
P
,
Shot, Scene and Keyframe Ordering for Interactive Video Re-use
627