Complexity Analysis of Video Frames by Corresponding Audio
Features
SeungHo Shin and TaeYong Kim
GSAIM, Chung-Ang University, Seoul, Korea
Keywords: Video Complexity Analysis, Video Indexing, Audio Indexing, Rate-control.
Abstract: In this paper, we propose a method to estimate the video complexity by using audio features based on
human synesthesia factors. By analyzing the features of audio segments related to video frames, we initially
estimate the complexity of the video frames and can improve the performance of video compression. The
effectiveness of proposed method is verified by applying it to an actual H.264/AVC Rate-Control.
1 INTRODUCTION
It is essential to detect video complexity with high
accuracy to improve the compression performance
by reducing unnecessary processing in video coding.
In this aspect, a video complexity analysis with
audio features has great advantages for reducing
computation and simplifying the algorithm
compared to the analysis using visual information
only. For instance, we can search combat scenes
easily by detecting frames that include gunfire or
explosive sounds in a video. However, it is hard to
detect these scenes by visual analysis, such as color
differences, histograms between frames, object
detection, motion estimation, and other various
features. Therefore, in the analysis and
understanding of video content, if we simultaneously
use the auditory information as an auxiliary element
for the visual information, it will improve the
accuracy of complexity analysis.
In this paper, we propose a novel method named
“Content-based Video Complexity Analysis
(CVCA)” that estimates the temporal and spatial
complexity based on human synesthesia factors by
analyzing the correlations between the video and
audio presented in moving pictures. The most
important characteristic of the CVCA is the use of
the variations of audio signals to analyze the
complexity of a video. The effectiveness of this
method is verified by applying it to an actual
H.264/AVC Rate-Control.
2 CORRELATION BETWEEN
VIDEO AND AUDIO FEATURES
The visual media that has a storyline is called
“synthetic art” and is communicated to our eyes and
ears with the video’s visual information and the
audio’s auditory information (Li, 2004). The audio
elements related to the video are able to be classified
into three factors: dialog between actors, sound
effects from the surrounding objects, and
background sounds for scene enhancement (Lu,
2002).
1) In the scene where actors make conversation
normally, the voice tone is maintained steadily and
the audio signals are regular and natural. However,
when the actors are arguing over something, the
voice tone becomes rough, causing the audio signals
to be irregular (Pinquier, 2002). In order to express
the confrontational scenes effectively in video, quick
camera-moving switches between actors, and the
actor’s motion is also increased with exaggerated
gestures to express agitated emotions.
2) The environment and place represented in the
scenes are variously changed according to the
storyline. The sound effects are dependent on the
given environment and place. Viewers can roughly
recognize situations and scenes via sound effects.
The most representative case would be a combat
scene including the gunfire and explosive sounds.
3) The background sounds are also used to more
effectively express the video scenes. The tempo and
rhythm of background sounds are one of the most
considered elements in the video editing works. In
111
Shin S. and Kim T..
Complexity Analysis of Video Frames by Corresponding Audio Features.
DOI: 10.5220/0003982001110114
In Proceedings of the International Conference on Signal Processing and Multimedia Applications and Wireless Information Networks and Systems
(SIGMAP-2012), pages 111-114
ISBN: 978-989-8565-25-9
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)