• The error caused on upper processing stages
leads to the total error of the search procedure.
For example, if the speech fragment was
falsely recognized as the music one the
keyword search will never even start.
• In a number of cases it is very hard to
distinguish speech from music on the
preliminary stage because of their similarity
on the most parameters (for example, Chinese
speech);
• In the real system we need to pass throw all
stages of analysis to find the required audio
fragment not depending on its context. In the
case of speech indexing different words have
different features with different distinction
power, and this apriory information has to be
used for fast search. For example, it very hard
to find very short words like “yes” or “no”,
but not a problem to find the specific long
word “synchrophasotron”.
In order to overcome these drawbacks there were
developed main approaches of content-sensitive
adaptable system, suitable for direct indexing an
search of different type of information in
audiodocuments.
In order to introduce “key-event” method lets do
three assumptions about audio signal. The first one
is that wavelet-image of audio contains all
significant information about time and spectrum-
based signal features (Kukharchik, Kheidorov etc.
I.E., 2008). All features like formants, fundamental
frequency, cepstrums, formats, short energy
parameters, etc. can be calculated using wavelet
image of audio, and can be considered as
transformations of wavelet image. These means that
we suppose that all necessary information (including
time dependences) about audio can be obtained
based on wavelet.
The second assumption is that any audio
fragment can be described as a set of specially
determined “events”, each event presents
acoustically significant property for the certain audio
part. The event has to be an important feature which
is typical for the specific type of audio objects. In
the broad sense such acoustical events (we will call
them “key-events”) presents the audio fragment
semantics, and can be used for the direct description
and indexing of audio. Specially selected “key-
events” can be transformed into indexes suitable for
storing in data bases.
The third assumption is that SVM classifier can
be trained to model any selected event in audio
signal with different accuracy. In other words, it is
supposed the SVM kernel can approximate any
feature parameter (transformation) of audio signal.
The developed and proposed “key-event” search
method includes two main stages: audio content
dependent “key-event” creation for target audio, and
target audio search based on its “key-events”
presentation. Each target audio has to be found by
looking for the key events appearance in a
continuous audio stream. The “key-event” search
can be done in the order of calculation complexity
growth, i.e. the simplest “key-event” has to be found
in the first order, then the second one and so on.
Each next search can be done within the framework
of the previous search results, and the final decision
has to be done by specially trained decoder (for
example, HMM). The search scheme is presented on
fig.1.
Figure 1: Audio search scheme based on key-event
ideology.
The main advantage of the proposed approach is
that we try to find the feature vector (which presents
the “key-event”) mostly suitable and typical for the
given content. This activity lets us to describe audio
in the terms of features of the specific audio, not by
common features like cepstrums and others. Such
approach allows usage of a priory audio information
as much as possible, and this information is to be
presented by “key-events”. The number of “key-
events” can differ for different audio fragments. Of
course, the procedure of “key-event” selection for
each audio is very complicated and time consuming,
but it can be iteratively. The most successful key-
events can be stored in the data base as the basis for
new “key-events”. Time by time the “key-event”
data base will be enlarged and finally will contain
the most of possible events for audio, and the new
target audio can be easily described in terms of
“key-events”. Different “key-events” can be joined
in an arbitrary order to form the semantic description
Target audio positions
Keyevents
HMM decoder
SIGMAP 2009 - International Conference on Signal Processing and Multimedia Applications
34