Figure 5: Text binarization results of our dynamic contrast-
brightness adaption method.
metric. In this way, the high frequency noises in the
video frames can be removed from the comparison by
enlarging the size of valid CCs. The segmentation re-
sults from the first step are too redundant for indexing,
since they may contain the progressive build-up of a
complete final slide over sequence of partial versions
(cf. Fig. 6). Therefore, the segmentation process con-
tinues with the second step: we first perform a statisti-
cal analysis on large amount of slide videos, intending
to find the commonly used slide content distribution
in lecture videos. Then, in the title region of the slide
we perform the same differencing-metric as in the first
step to capture the slide transition. Any change in the
title region may cause the slide transition. While in
the content region, we detect regions of the first and
the last text lines. Checking the same regions in two
adjacent frames. When the same text lines can not be
found in both of the adjacent frames, a slide transition
is then captured. More details about the algorithm can
be found in (Yang et al., 2011).
Since the proposed method in (Yang et al., 2011)
is defined for slide frames, it might be not suitable,
when videos with varying genres were embedded in
the slides and/or are played during the presentation.
To fix this issue, we have developed a SVM classi-
fier to distinguish the slide frames and the other video
frames. In order to train an efficient classifier we eval-
uate two features: HOG (Histogram of Oriented Gra-
dient) feature and image intensity histogram feature.
The detailed evaluation for both features is discussed
in Section 6.
Histogram of oriented gradient feature has been
widely used in object detection (N. Dala, 2005) and
optical character recognition fields, due to its effi-
ciency for description of the local appearance and the
shape variation of image objects. To calculate the
HOG feature, the gradient vector of each image pixel
within a predefined local region is calculated by using
Sobel operator (Sobel, 1990). All gradient vectors are
then decomposed into n directions. A histogram is
subsequently created by using accumulated gradient
vectors, and each histogram-bin is set to correspond
to a gradient direction. In order to avoid sensitivity
of the HOG feature to illumination, the feature values
are often normalized.
Observing slide frames, the major slide content,
like texts and tables, consists of many horizontal and
vertical edges. The distribution of gradients would be
distinguishable from other video genres.
We have also adopted an image intensity his-
togram feature to train the SVM classifier. Since
the complexity and the intensity distribution of slide
frames are strongly different from other video genres,
we decided to use this simple and efficient feature to
make a comparison to the HOG feature.
3.2 Video OCR
The texts in video provide a valuable source for index-
ing. Regarding lecture videos, the texts displayed on
slides are strongly related to the lecture content. Thus,
we have developed an approach for video text recog-
nition by using video OCR technology. The com-
monly used video OCR framework consists of three
steps.
Text detection is the first step of video OCR: this
process determines, whether a single frame of a video
file contains text lines, for which a tight bounding
box is returned (cf. Fig. 4). Text extraction: in this
step, the text pixels are separated from their back-
ground. This work is normally done by using a bi-
narization algorithm. Fig. 5 shows the text binariza-
tion results of our dynamic contrast-brightness adap-
tion method (Yang et al., 2011). The adapted text
line images are converted to an acceptable format for
a standard print-ocr engine. Text recognition: we
applied a multi-hypotheses framework to recognize
texts from extracted text line images. The subsequent
spell-checking process will further filter out incorrect
words from the recognition results.
The video OCR process is applied on segmented
key frames from the segmentation stage. The occur-
rence duration of each detected text line is determined
by reading the time information from the correspond-
ing segment. Our text detection method consists of
two stages: we develop an edge-based fast text detec-
tor for coarsely detection, and a SWT (Stroke Width
Transform) (B. Epshtein, 2010) based text verifica-
tion procedure is adopted to remove the false alarms
from the detection stage. The output of our text detec-
tion method is an XML-encoded list serving as input
for the text recognition and the outline extraction.
For the text recognition, we develop a multi-
hypotheses framework. Three segmentation methods
are applied for the text extraction: Otsu threshold-
ing algorithm (Otsu, 1979), Gaussian based adaptive
thresholding (R. C. Gonzalez, 2002) and our dynamic
image contrast-brightness adaption method. Then, we
apply the OCR engine on each segmentation result.
CSEDU2012-4thInternationalConferenceonComputerSupportedEducation
16