we propose some improvements of this methodology
and its application in video summarization.
Specifically, the methodology proposed in
(Ramos et al., 2018) uses the following pipeline: (a)
build the labeled training set X
s
in the original space;
(b) perform a coarse temporal flow segmentation
using a simple similarity measure combined with an
interval tree analysis; (c) select key numerical frames
inside each obtained segment and assemble these
frames in a training set X
t
; (d) a semi-supervised AE
architecture is trained using the sets X
s
and X
t
to
generate a metric distance (encoding function); (e)
Apply X-means (Pelleg and Moore, 2000) clustering
technique, with the metric distance, to compute
the final partition of the frame sequence; (f) For
each obtained cluster, take a key-frame to build the
summarization sequence.
However, we have noticed that interval tree and
a simple distance metric (like Frobenius distance) do
not provide a satisfactory temporal segmentation in
the frames of videos. Futhermore, the X-means al-
gorithm tries to perform clustering without the need
to set a pre-defined number K of groups. However,
in the case of video sequences, the obtained results
are not satisfactory. Besides, it is not clear the ad-
vantages of a high cost semi-supervised AE against a
simpler unsupervised one. Therefore, in our work we
propose to use total variation denoising, followed by
differentiation and thresholding instead of the interval
tree. We also replace the distance metric by a similar-
ity measure based on the data correlation. Moreover,
we perform clustering using K-means instead of X-
means. We also compare a simple AE with the semi-
supervised AE proposed in (Ramos et al., 2018) to
evaluate the encoded quality of the former against the
latter. Up to the best of our knowledge, this is the
first work that applies a semi-supervised technique for
video summarization in a systematic procedure. The
whole methodology is the main contribution of this
paper.
The remaining text is organized as follows. Sec-
tion 2 we survey related works. In section 3 we
describe the background techniques. The proposed
methodology is presented in section 4. Next, section
5 discusses the computational experiments. Conclu-
sions and possible future works are presented in sec-
tion 6.
2 RELATED WORKS
A variety of video summarization techniques have
been proposed as seen in the literature (Yuan et al.,
2019; Zhao et al., 2019; Li et al., 2019; de Avila
et al., 2011). In this paper we focus only on the tech-
nique based on image features to construct the sum-
mary. Generally, those techniques can be classified
into unsupervised and supervised ones.
Clustering algorithms are one of the most popu-
lar unsupervised methodology. Given hand-craft fea-
tures, similar frames are grouped forming a cluster
and, from each cluster, the centers are taken to build
the summary. Also, some works in unsupervised class
use the frame histograms to learn clusters (de Avila
et al., 2011). In (Mohan and Nair, 2018) there is an
extension of this approach: the shot histograms are
compressed by a high-level feature extractor before
the clustering step. Other works construct models, as
in (Lei et al., 2019), where the video is modeled as
graph, where a graph vertex corresponds to a frame
and the edge between vertexes is weighted by the
Kullback-Leibler divergence of two frames based on
semantic probability distributions. A clustering and a
rank operation is applied on this graph method to cre-
ate the summary. Recent works in unsupervised sum-
marization have applied deep learning (Yuan et al.,
2019; Zhao et al., 2019; Jung et al., 2018), using in
general, convolutional or recurrent neural networks.
Deep learning is the principal tool in supervised
summarization approaches. In this case, the sum-
mary is modeled as a classification problem, where
in the ground truth, frames have labels that represent
classes in the video (Jung et al., 2018). However, even
with better results in video summarization, labeling
for many video frames is tedious task that depends on
the human intervention. Moreover, overfitting prob-
lems can frequently occur if no sufficient labeled data
is available. Therefore a semi-supervised approach,
which uses just one labeled video to summarize a set
of other videos, can mitigate these limitations being a
contribution to the area, which motivates our work.
3 TECHNICAL BACKGROUND
In mathematical terms, the total variation denoising
computes a smooth version of an input signal y ∈
R
N
by solving the optimization problem (Selesnick,
2012),
ˆx = arg min
x∈R
N
1
2
k
y − x
k
2
2
+ λkDxk, (1)
where x ∈ R
N
; λ is the smooth factor, and D is a ma-
trix (N − 1) × N described in (Selesnick, 2012).
An AE can be viewed as a special case of feed-
forward neural network that is trained to reproduce
its input at the output layer (Goodfellow et al., 2016).
The AE architecture is depicted in Figure 1. Both
VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications
316