13, and R class of trailers, all labeled as ground-truth
true, and the remaining movie rating classes are la-
beled false. At the classification step, an unlabeled
test trailer is assigned to a rating class that produces
the largest value of hyper-plane distance, in feature
space. Implementation wise, our software controls
the one-against-all outer loop, invoking SVM-Light
repeatedly four times, once for each rating class vs.
rest configuration.
4.2 Ranked Information Retrieval
The Vector Space Model is the most commonly used
in IR. This model ranks more relevant documents
higher than less relevant ones, with respect to a query
that is comprised of a collection of terms. Both the
query and the documents are represented as vectors
in the space, and documents are ranked based on their
proximity to the query. Proximity is the similarity
measure of two vectors, and is roughly a function of
the inverse of the distance between them. The notion
adapted by the IR community, to rank documents in
increasing order of the cosine of the angle between
the query and the document vectors, stems from the
cosine function property, for monotonically decreas-
ing in the interval [0
◦
,180
◦
].
A vector is length normalized by dividing each of
its weighting components, by its length. For normal-
ization, IR uses the L2-norm, expressed as k~xk
2
=
q
∑
i
x
2
i
. Dividing a vector by its L2-norm, makes it a
unit vector, and thus short and long trailer sequences
of scaled terms, now have comparable weights. We
define cosine similarity as the dot product of the query
sound track vector, q, and a training trailer audio vec-
tor, m, both length normalized:
cos(~q,~m) = ~q·~m =
|V|
∑
i=1
q
i
m
i
. (5)
We compute the cosine similarity score for the
query trailer and each of the training trailers in our
dataset. Training trailers, with respect to the query,
are ranked for relevancy by their score, and the top
M(M = 10) are returned for further analysis.
5 EMPIRICAL EVALUATION
To evaluate our system in practice, we have imple-
mented a Direct2D audio application that reads raw
WAV files, splits each to a property header and data
sections, and loads the normalized time signal and its
sampling rate parameter to our movie rating, C++ li-
brary. Our library operates on the raw audio, com-
mences feature extraction followed by vector quan-
tization, and performs discrimination and similarity
calculations. We use the hold out method with cross
validationto rank the performanceof our system. For-
mally, our library sets up one of random and 10-fold
resampling modes, and each rating class becomes a
two-way data split of trailers, with train and test sets,
owning 80/20 percent shares, respectively.
5.1 Experimental Setup
Disposed at matching movie content to audience age,
the productivity of bag of audio words representation
is assessed. We build a labeled base set of 25 trailers,
for each the G, PG, PG-13, and R ratings. The base set
is drawn randomly from previously rated, movie trail-
ers, off IMDb (IMDb, 1990). Our collection incorpo-
rates more recent and modern titles that are readily
accessible online, but also subscribes to a fair share
of productions that span over two decades of movie
making. The high quality, 16-bit mono, 44.1KHz
WAV files, each produced of a minute long record-
ing, is about 5MB in size, setting the base set to total
100-minute footage of combined 0.5GB size. With
an average of nearly 12,000 MFCC feature vectors,
extracted from the audio sequence of a base trailer,
our system processes a total of 1.2 million features.
We then apply signal distortion, artificially to each
of the labeled recording, and augment our learning
data set by a factor of ten, to yield an aggregate of
250 auditory samples, per rating class. Leading to
a combined 1000 system trailer set that subscribes
to a 5GB footage of both physical and virtual audio.
Rather than deforming the source signal (Riley et al.,
2008), a step that is computationally intensive, we ob-
tain an identical effect by simply perturbing the his-
togram of the base word vector, and randomly mod-
ulating the term frequency of a word in the interval
{−5%,+5%}. This slightly warped version of a word
vector, conforms to both discriminatory and similarity
margins, to rule rating class inclusion.
Scalability to a dynamically grown, synthetic au-
ditory data set, for attaining higher classification per-
formance, is a principal guidance in the design of our
proposed system. Here we discuss three implementa-
tion considerations that ensure the efficiency of ma-
jor computational steps. First, the task of process-
ing over a million feature vectors, is a critical com-
pute section in our implementation. We find the exe-
cution of feature extraction and vector quantization,
in sequentially iterating the base set, prohibitively
inefficient. Rather, we exploit concurrency, using
the latest C++11 futures and asynchronous launching
methodology. In parallel processing independent au-
SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications
146