KNOWLEDGE AND CONTENT-BASED AUDIO RETRIEVAL USING

WORDNET

Pedro Cano and Markus Koppenberger and Sylvain Le Groux

and Perfecto Herrera and Julien Ricard and Nicolas Wack

Institut de l’Audiovisual, Universitat Pompeu Fabra

c/Ocata 3, 08003 Barcelona, Spain

Keywords:

MPEG7, content-based audio, sound effects, ontology management, WordNet, audio classiﬁcation, audio asset

management

Abstract:

Sound producers create the sound that goes along the image in cinema and video productions, as well as

spots and documentaries. Some sounds are recorded for the occasion. Many occasions, however, require

the engineer to have access to massive libraries of music and sound effects. Of the three major facets of

audio in post-production: music, speech and sound effects, this document focuses on sound effects (Sound

FX or SFX). Main professional on-line sound-fx providers offer their collections using standard text-retrieval

technologies. Library construction is an error-prone and labor consuming task. Moreover, the ambiguity and

informality of natural languages affects the quality of the search. The use of ontologies alleviates some of

the ambiguity problems inherent to natural languages, yet it is very complicated to devise and maintain an

ontology that account for the level of detail needed in a production-size sound effect management system.

To address this problem we use WordNet, an ontology that organizes over 100.000 concepts of real world

knowledge: e.g: it relates doors to locks, to wood and to the actions of opening, closing or knocking. However

a fundamental issue remains: sounds without caption are invisible to the users. Content-based audio tools offer

perceptual ways of navigating the audio collections, like “ﬁnd similar sound”, even if unlabeled, or query-by-

example, possibly restricting the search to a semantic subspace, such as “vehicles”. The proposed content-

based technologies also allow semi-automatic sound annotation. We describe the integration of semantically-

enhanced management of metadata using WordNet together with content-based methods in a commercial

sound effect management system.

1 INTRODUCTION

The audio component is a fundamental aspect in an

audiovisual production. Around 75% of the sound ef-

fects (SFX) of a movie are added at post-production.

Many sounds are useless due to the noise in the

recording session and some are simply not picked up

by the production microphones. Sometimes sounds

are replaced in order to improve the dramatic impact,

e.g.: arrow sounds of the “Lord of the Rings” are re-

placed by “whooshes”. There are also artistic rea-

sons, for example, in the movie “All the President’s

Men”, in order to strengthen the message that the pen

is mightier that the sword, the typewriter keys sounds

were mixed with the sound of gunﬁre(Weis, 1995).

Many occasions, not only movies but also com-

puter games, audio-visual presentations, web-sites re-

quire sounds. These sounds can be recorded as well

as recreated using foley techniques—for the sound of

the knife entering the body in Psycho’ shower scene,

Hitchcock used a melon (Weis, 1995). Another pos-

sibility is the use of already compiled SFX libraries.

Accessing library sounds can be an interesting alter-

native to sending a team to record sounds or to recre-

ate them in a studio because one needs then a foley pit

and the rest of the recording equipment.

A number of SFX providers, for example:

www.sounddogs.com, www.sonomic.com or

www.sound-effects-library.com, offer SFX on-

line. The technology behind these services is

standard text-search. Librarians tag sounds with de-

scriptive keywords that the users may search. Some

companies also keep categories—such as “automo-

biles”, “horror” or “crashes”—to ease the interaction

with the collections. This approach presents several

limitations. The work of the librarian is error-prone

and a very time-consuming task. Solutions have

been proposed to manage media assets from a audio

301

Cano P., Koppenberger M., Le Groux S., Herrera P., Ricard J. and Wack N. (2004).

KNOWLEDGE AND CONTENT-BASED AUDIO RETRIEVAL USING WORDNET.

In Proceedings of the First International Conference on E-Business and Telecommunication Networks, pages 301-308

DOI: 10.5220/0001397503010308

 SciTePress

content-based audio perspective, both from the

academia and the industry (see Section 2). However

none seems to have impacted in professional sound

effects management systems. Another source of

problems is due to the imprecision and ambiguity

of natural languages. Natural languages present

polysemy—“bike” can mean both “bicycle” and

“motorcycle”—and synonymy—both “elevator” and

“lift” refer to the same concept. This, together with

the difﬁculty associated to describing sounds with

words, affects the quality of the search. The user has

to guess how the librarian has labeled the sounds and

either too many or too few results are returned.

In this context we present a SFX retrieval system

that incorporates content-based audio techniques and

semantic knowledge tools implemented on top of one

of the biggest sound effects providers database. The

rest of the paper is organized as follows: in Section 2

we review what what existing literature proposes to

improve sound effect management. From Section 3

to 5 we describe the implemented enhancements of

the system.

2 RELATED WORK

Related work to sound effect management falls into

three categories: Content-based audio technologies,

approaches to describe sound events and taxonomy

management.

2.1 Content-based audio

classiﬁcation and retrieval

Content-based functionalities aim at ﬁnding new

ways of querying and browsing audio documents as

well as automatic generating of metadata, mainly

via classiﬁcation. Query-by-example and simi-

larity measures that allow perceptual browsing of

an audio collection is addressed in the literature

and exist in commercial products, see for instance:

www.ﬁndsounds.com, www.soundﬁsher.com.

Existing classiﬁcation methods normally concen-

trate on small domains, such as musical instrument

classiﬁcation or very simpliﬁed sound effects tax-

onomies. Classiﬁcation methods cannot currently of-

fer the detail needed in commercial sound effects

management, e.g: “female steps on wood, fast”. In

audio classiﬁcation, researchers normally assume the

existence or deﬁne a well deﬁned hierarchical classiﬁ-

cation scheme of a few categories (less than a hundred

at the leaves of the tree). On-line sound effects and

music sample providers have several thousand cate-

gories. For further discussion on classiﬁcation of gen-

eral sound, we refer to (Cano et al., 2004a).

2.2 Description of Audio

Sounds are multifaceted, multirepresentional and usu-

ally difﬁcult to describe in words. MPEG-7 offers

a framework for the description of multimedia doc-

uments, see (Manjunath et al., 2002). MPEG-7 con-

tent semantic description tools describe the actions,

objects and context of a scene. In sound effects, this

correlates to the physical production of the sound in

the real world, “moo cow solo”, or the context, “Air-

port atmos announcer”.

MPEG-7 content structure tools concentrate on the

spatial, temporal and media source structure of mul-

timedia content. Indeed, important descriptors are

those that describe the perceptual qualities indepen-

dently of the source and how they are structured on

a mix. Since they refer to the properties of sound,

e.g: Loudness, brightness. Other important search-

able metadata are post-production speciﬁc descrip-

tions, e.g.: horror, comic or science-ﬁction. Creation

metadata describe how the sound was recorded. For

example, to record a car door closing one can place

the microphone in the interior or in the exterior. Some

examples of such descriptors are: interior, exterior,

close-up, live recording, programmed sound, studio

sound, treated sound. For a more complete review on

SFX description, we refer to (Cano et al., 2004b).

Figure 1: Snapshot of the vehicle taxonomy in WordNet

2.0. Only the hypernym type of relation is displayed.

2.3 Taxonomy Management

The use of taxonomies or classiﬁcation schemes al-

leviates some of the ambiguity problems inherent

to natural languages, yet they pose others. It is

very complicated to devise and maintain classiﬁcation

schemes that account for the level of detail needed in

a production-size sound effect management system.

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

302

The MPEG-7 standard provides description mecha-

nisms and ontology management tools for multime-

dia documents (Manjunath et al., 2002). (Celma and

Mieza, 2004) show an opera information system that

exploits taxonomies built with classiﬁcation schemes

using MPEG-7 framework. However, it is very com-

plicated to extend them to the level of detail needed in

a production-size sound effect management system.

We have found that it is much faster to start develop-

ing ontologies on top on a semantic network such as

WordNet rather than starting from scratch.

WordNet (Miller,

1995)/(http://www.cogsci.princeton.edu/ wn/) is

a lexical network designed following psycholin-

guistic theories of human lexical memory. Standard

dictionaries organize words alphabetically. WordNet

organizes concepts in synonym sets, synsets, with

links between the concepts like: broader sense,

narrower sense, part of, made of and so on. It knows

for instance that the word “piano” has two senses,

the musical attribute that refers to “low loudness”

and the “musical instrument”. It also encodes the

information that a grand piano is a type of piano,

and that it has parts such us a keyboard, a loud

pedal and so on. Such a knowledge system is

useful for retrieval. It can for instance display the

results of a query “car” in types of cars, parts of

car, actions of a car (approaching, departing, turning

off). Figure 1 displays a subset of the “vehicle”

graph from WordNet 2.0. The usefulness of such

knowledge systems has been justiﬁed for image

retrieval in (Aslandogan et al., 1997) and in general

multimedia asset management (Flank, 2002).

3 SYSTEM OVERVIEW

Text-based and content-based methods alone do not

seem to sufﬁce for a complete interaction with vast

sound effects repositories. In the implemented sys-

tem we aim to combine the best of two worlds to offer

tools for the users to reﬁne and explore a huge collec-

tion of audio. Similar work on integrating perceptual

and semantic information in a more general multi-

media framework is MediaNet (Benitez et al., 2000).

The system we present is specialized for SFX. The

current prototype uses 80.000 sounds from a major

on-line sound effects provider: www.sound-effects-

library.com. Sounds come with a textual description

which has been disambiguated with the augmented

WordNet ontology.

3.1 Functional blocks

The system has been designed to ease the use of dif-

ferent tools to interact with the audio collection and

with speed as a major design issue. On top of these

premises we have implemented the blocks of Fig. 2.

The sound analysis, audio retrieval and metadata gen-

eration blocks are described in Section 4. The text re-

trieval, text processor and knowledge manager blocks

are described in Section 5.

Figure 2: Functional Block Diagram of the System.

3.2 System architecture

The audio processing engines use the CLAM Frame-

work, a C++ Library for Audio and Music, devel-

oped at the MTG and distributed under GNU/GPL

License (http://www.iua.upf.es/mtg/clam). The on-

tology management and integration of different parts

is done with Perl and a standard relational database

management system. The functionality is avail-

able via a web interface and exported via SOAP

(http://www.w3.org/TR/soap). The SOAP interface

provides some exclusive functionality—such as inter-

action with special applications, e.g.: Sound editors

and annotators— which is not available via the web

interface. See Figure 3 for a diagram of the architec-

ture.

4 CONTENT-BASED AUDIO

TOOLS

Content-based audio tools ease the work of the librar-

ian and enhance the search possibilities for the user.

It simpliﬁes the labeling of new sounds because many

keywords are automatically presented to the librar-

ian. To achieve it, the new sound is compared to

the collection with Nearest Neighbor search and the

text associated with the similar matches is presented

to the librarian. The sound analysis module (see Fig-

ure 2), besides extracting sound descriptors used for

the similarity search, generates searchable descriptors

KNOWLEDGE AND CONTENT-BASED AUDIO RETRIEVAL USING WORDNET

303

Data Storage Application Server

Web Clients

SOAP

Interface

Web

Server

SOAP Clients

(e.g. Audacity, ProTools-Plugin, ...)

Knowledge

Manager

XML / XSLT

Transformation

HTTP Request

+ Access to audio and

meta data

+ Browsing

+ Search tools

(similarity, query-by-

example, ...)

+ Audio converter

+ Meta data extractors

+ Algorithms

+ Data Management

(adding of audio & meta-

data, modifications)

+ Ontology management,

+ User management

and access rights

+ Usage statistic (views,

searches, downloads, ...)

Functionalities

Objects repre-

senting Audio

and Meta Data

Relational DBMS

(Meta data)

User Clients

File System

Audio data and

additional meta data

(e.g. XML-files)

Data

Manager

Researcher /

Developer

Librarian

Web Client /

other GUIs

(i.e. Mozilla, Internet Explorer)

User

Manager

(examples)

Figure 3: System Architecture

as those detailed in Subsection 4.2 (crescendo, noisy,

etc.). Content-based tools offer the user functionali-

ties such as:

Virtual Foley Mode: Find perceptually similar

sounds. A user may be interested in a glass crash

sound. If none of the retrieved sounds suits him, he

can still browse the collection for similar sounds

even if produced by different sources, even if unla-

beled.

Clustering of sounds: Typically a query like

“whoosh” may retrieve several hundred results.

These results are clustered and only one represen-

tative of each class is displayed to the user. The

user can then reﬁne the search more easily.

Morphological Descriptors: Another option when

the list of results is too large to listen to is ﬁlter-

ing the results using morphological descriptors (see

Section 4.2).

Query by example: The user can provide an example

sound or utter himself one as a query to the system,

possibly restricting the search to a semantic sub-

space, such as “mammals”.

4.1 Similarity Distance

The similarity measure is a normalized Manhat-

tan distance of features belonging to three different

groups: a ﬁrst group gathering spectral as well as tem-

poral descriptors included in the MPEG-7 standard; a

second one built on Bark Bands perceptual division

of the acoustic spectrum and which outputs the mean

and variance of relative energies for each band; and,

ﬁnally a third one, composed of Mel-Frequency Cep-

stral Coefﬁcients and their corresponding variances

(see (Cano et al., 2004a) for details):

d (x, y) =

k=1

− y

(max

− min

)

where x and y are the vectors of features, N the

dimensionality of the feature space, and max

and

min

the maximum and minimum values of the kth

feature.

The similarity measure is used for metadata genera-

tion: a sound sample will be labeled with the descrip-

tions from the similar sounding examples of the anno-

tated database. This type of classiﬁcation is known as

one-nearest neighbor decision rule (1-NN)(Jain et al.,

2000). The choice of a memory-based nearest neigh-

bor classiﬁer avoids the design and training of every

possible class of sound which is of the order of sev-

eral thousands. Besides, it does not need redesign or

training whenever a new class of sounds is added to

the system. The NN classiﬁer needs a database of la-

beled instances and a similarity distance to compare

them. An unknown sample will borrow the metadata

associated with the most similar registered sample.

The similarity measure is also used for the query-

by-example and to browse through “perceptually”

generated hyperlinks.

4.2 Morphological Sound

Description

The morphological sounds descriptor module extracts

a set of descriptors that focused on intrinsic percep-

tual qualities of sound based on Schaeffer’s research

on sound objects (Schaeffer, 1966). The extractor of

morphological descriptors (Ricard and Herrera, 2004)

currently generates the following metadata:

Pitchness: (Organization within the spectral dimen-

sions) Pitch, Complex and Noisy.

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

304

Dynamic Proﬁle: (Intensity description) Unvarying,

Crescendo, Decrescendo, Delta, Impulsive, Itera-

tive, Other.

Pitchness Proﬁle: (Temporal evolution of the inter-

nal spectral components) Varying and Unvarying

Pitch Proﬁle: (Temporal evolution of the global

spectrum) Undeﬁned, Unvarying, Varying Contin-

uous, Varying Stepped.

Figure 4: Morphological descriptor ﬁltering. The iterative

dynamic proﬁle allows to discriminate between snare sam-

ples and loops

Figure 5: Morphological description ﬁltering. The impul-

sive dynamic proﬁle allows to discriminate violin pizzicati.

These descriptors can be used to retrieve abstract

sounds as well as reﬁne other types of searches. Be-

sides applying to all types of sounds, the use of an

automatic extractor avoids expensive human labeling

while it assures consistency. For details on the con-

struction and usability evaluation of the morphologi-

cal sound description we refer to (Ricard and Herrera,

2004).

4.3 Clustering and Visualization

Tools

Usually, systems for content-based retrieval of sim-

ilar sounds output a list of similar sounds ordered

by increasing similarity distance. The list of re-

trieved sounds can rapidly grow and the search of

the appropriate sound becomes tedious. There is a

need for a user-friendly type of interface for brows-

ing through similar sounds. One possibility for avoid-

ing having to go over, say 400 gunshots, is via clus-

tering sounds into perceptually meaningful subsets,

so that the user can choose what perceptual cate-

gory of sound he or she wishes to explore. We

used a hierarchical tree clustering with average link-

age algorithm and the above mentioned similarity dis-

tance (Jain et al., 2000). Another possibility of in-

teraction with the sounds is using visualization tech-

niques, speciﬁcally Multidimensional scaling (MDS),

self-organizing maps (SOM) or FastMap (Cano et al.,

2002), to map the audio samples into points of an Eu-

clidean space. Figure 6 displays a mapping of the au-

dio samples to a 2D space. In the example it is possi-

ble to distinguish different classes of cat sounds, e.g.:

“purring”, “hissing” and “miaow” sounds.

5 NATURAL LANGUAGE

PROCESSING AND

KNOWLEDGE MANAGER

This module enhances existing text-search engines

used in sound effects retrieval systems. It eases the li-

brarian work and it simpliﬁes the management of the

categories.

• There is a lemmatizer, say “bikes” becomes “bike”,

an inﬂecter that allows to expand it to “bike, bikes

and biking”, and a named entity recognition mod-

ule, that is able to identify “Grand piano” as a spe-

ciﬁc type of piano.

• Module for the phonetic matching, e.g:

“whoooassh” retrieves “whoosh”. Phonetic

matching is used in information retrieval to

account for the typo errors in a query.

• Higher control on the precision and recall of the re-

sults using WordNet concepts. The query “bike”

returns both “bicycle” and “motorcycle” sounds

and the user is given the option to reﬁne the search.

• Proposal of related terms. It is generally accepted

that recognition is stronger than recall. A user may

not know how the librarian tagged a sound. Word-

Net can be used to propose alternative search terms.

• Proposal of higher level related term not included

in the lexical network. WordNet does not have

all possible relations. For instance, “footsteps in

mud”, “tractor”, “cow bells” and “hens” may seem

related in our minds when we think of farm sounds

but do not have direct links within WordNet. It

is possible to recover this type of relations be-

cause there are many sounds that have been la-

beled with the concept “farm”. Studying the co-

occurrence of synsets allows the system to infer re-

lated terms (Banerjee and Pedersen, 2003).

For further details on the implementation and evalua-

tion of WordNet as backbone for sound effect ontol-

ogy management, we refer to (Cano et al., 2004b).

KNOWLEDGE AND CONTENT-BASED AUDIO RETRIEVAL USING WORDNET

305

Figure 6: FastMap visualization screenshot. The points of the 2D map refer to different audio samples. The distances on

the euclidean space try to preserve distances in the hyper-dimensional perceptual space deﬁned by the similarity distance of

subsection 4.1

6 EXPERIMENTS

We have used 40.000 sounds from the Sound-Effects-

Library (http://www.sound-effects-library.com) for

the experiments. These sounds have been unambigu-

ously tagged with concepts of an enhanced Word-

Net. Thus a violin sound with the following cap-

tion:“Concert Grand Piano - piano” may have the fol-

lowing synsets (the numbers on the left are the unique

WordNet synset identiﬁers):

• 02974665%n concert grand, concert piano – (a

grand piano suitable for concert performances)

• 04729552%n piano, pianissimo – ((music) low

loudness)

6.1 Experimental setup

The evaluation of similarity distances is a tricky sub-

ject. Perceptual listening tests are expensive. Another

possibility is to evaluate the goodness of the similar-

ity measure examining the performance in a Nearest

Neighbor (NN) classiﬁcation task. As we will see in

6.3, the overlap between semantic and perceptual tax-

onomies complicates the evaluation. In musical in-

struments, the semantic taxonomy more or less fol-

lows an acoustic classiﬁcation scheme, basically due

to the physical construction, and so instruments are

wind (wood and brass), string (plucked or bowed) and

so on (Herrera et al., 2003).

We have tried three ways of assessing the percep-

tual similarity between sounds:

• Perceptual listening experiments

• Classiﬁcation or metadata generation performance

• Consistency on the ranking and robustness to dis-

tortions such as resampling, transcoding (convert-

ing to MP3 format at different compression rates

and back). The harmonic instruments have been

transcoded and resampled into WAV PCM format

and Ogg format (www.vorbis.com). The classiﬁca-

tion accuracy dropped from 92% to 71.5% in the

worst case.

6.2 Perceptual listening tests

In order to test the relevance of our similarity mea-

sures, we asked users of our system to give a per-

sonal perceptual evaluation on the retrieval of sounds

by similarity. This experiment was accomplished on

20 users who chose 41 different queries, and pro-

duced 568 evaluations on the relevance of the simi-

lar sound retrieved. During the evaluation, the users

were presented with a grading scale from 1—not sim-

ilar at all—to 5—closely similar. The average grade

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

306

Table 1: The classiﬁer assigns the metadata of the sounds of the second column to the sounds of the ﬁrst.

Query Sound Caption Nearest-neighbor Caption

1275cc Mini Cooper Door Closes Interior Perspective Trabant Car Door Close

Waterfall Medium Constant Extremely Heavy Rain Storm Short Loop

M-domestic Cat- Harsh Meow A1v:Solo violin (looped)

Auto Pull Up Shut Off Oldsmobile Cutlass Ferrari - Hard Take Off Away - Fast

was 2.6. We have at our disposal the semantic con-

cepts associated with the 40.000 sounds used in the

experiment. It turned out that the semantic class of

a sound is crucial in the user’s sensation of similar-

ity. The conclusion of our experiment is that the users

gave better grades to retrieved sounds that are from

the same semantic class as the query sound (40%

of the best graded sounds belonged to the same se-

mantic class). In the prototype, in addition to the

purely content-based retrieval tools, the use of the

knowledge-based tools allows searches for similar

sounds inside a speciﬁc semantic family.

6.3 Metadata annotation

performance

The ﬁrst experiment on metadata annotation perfor-

mance consisted in ﬁnding a best-match for all the

sounds in the database. Table 1 shows some exam-

ples: on the left column the original caption of the

query sound and on the right the caption of the nearest

neighbor. The caption on the right would be assigned

to the query sound in an automatic annotation system.

As can be inferred from table 1 it is difﬁcult to

quantitatively evaluate the performance of the system.

An intersection on the terms of the captions would not

yield a reasonable evaluation metric. The WordNet

based ontology can inform us that both “Trabant” and

“Mini Cooper” are narrow terms for the concept“car,

automobile”. Thus, the comparison of the number of

common synsets on both query and nearest-neighbor

could be used as a better evaluation. The number

of concepts (synsets) that the sound in the database

and their best match have in common was bigger than

one—at least one synset—half of the time. Yet there

are many cases when this metric is not appropriate

for similarity evaluation. The intersection of source

descriptions can be zero for very similar sounding

sounds. The closest-match for a “paper bag” turns out

to be a “eating toast”. These sounds are semantically

different but perceptually equivalent. The ambiguity

is a disadvantage when designing and assessing per-

ceptual similarity distances.

In a second experiment we have tested the gen-

eral approach in reduced domain classiﬁcation regime

mode: percussive instruments, harmonic instruments

and we achieve acceptable performances. The as-

sumption is that there is a parallelism between se-

mantic and perceptual taxonomies in musical instru-

ments. The psychoacoustic studies of (Lakatos, 2000)

revealed groupings based on the similarities in the

physical structure of instruments. We have therefore

evaluated the similarity with classiﬁcation on the mu-

sical instruments space, a subspace of the universe of

sounds.

In the 6 class percussive instrument classiﬁcation

we achieve a 85% recognition (955 audio ﬁles) using

10 fold validation. The results for a 8 class classiﬁ-

cation of harmonic instruments is a 77.3% (261 audio

ﬁles). We refer to (Cano et al., 2004a) for further dis-

cussion on general sound similarity and classiﬁcation.

7 SUMMARY

We have introduced the difﬁculties inherent in inter-

acting with sound effect repositories, both for the li-

brarian who designs such content repositories and for

potential users who access this content. We have

presented several technologies that enhance and ﬁt

smoothly into professional sound effects providers

working processes. Several content-based audio tools

have been integrated providing possibilities of access-

ing sounds which are unrelated from the text caption

but sound the same—even if they are unlabeled. Sev-

eral natural language processing tools have also been

described. WordNet, previously proposed for multi-

media retrieval has been extended for sound effects

retrieval.

The system can be accessed and evaluated at

http://www.audioclas.org.

ACKNOWLEDGEMENTS

We thank the staff from the Tape Gallery for all the

support, discussion and feedback. This work is de-

veloped under a EU E! 2668 Eureka AudioClas. We

thank Oscar Celma and Fabien Gouyon for the review.

KNOWLEDGE AND CONTENT-BASED AUDIO RETRIEVAL USING WORDNET

307

REFERENCES

Aslandogan, Y. A., Thier, C., Yu, C. T., and nd N. Rishe,

J. Z. (1997). Using semantic contents and WordNet in

image retrieval. In Proc. of the SIGIR, Philadelphia,

PA.

Banerjee, S. and Pedersen, T. (2003). The design, imple-

mentation, and use of the Ngram Statistic Package.

In Proceedings of the Fourth International Conference

on Intelligent Text Processing and Computational Lin-

guistics, Mexico City.

Benitez, A. B., Smith, J. R., and Chang, S.-F. (2000). Me-

dianet: A multimedia information network for knowl-

edge representation. In Proceedings of the SPIE 2000

Conference on Internet Multimedia Management Sys-

tems, volume 4210.

Cano, P., Kaltenbrunner, M., Gouyon, F., and Batlle, E.

(2002). On the use of FastMap for audio information

retrieval. In Proceedings of the International Sympo-

sium on Music Information Retrieval, Paris, France.

Cano, P., Koppenberger, M., Groux, S. L., Ricard, J., Her-

rera, P., and Wack, N. (2004a). Nearest-neighbor

generic sound classiﬁcation with a wordnet-based tax-

onomy. In Proc.116th AES Convention, Berlin, Ger-

many.

Cano, P., Koppenberger, M., Herrera, P., and Celma, O.

(2004b). Sound effects taxonomy management in pro-

duction environments. In Proc. AES 25th Int. Conf.,

London, UK.

Celma, O. and Mieza, E. (2004). An opera information sys-

tem based on MPEG-7. In Proc. AES 25th Int. Conf.,

London, UK.

Flank, S. (July-September 2002). Multimedia technology

in context. IEEE Multimedia, pages 12–17.

Herrera, P., Peeters, G., and Dubnov, S. (2003). Automatic

classiﬁcation of musical instrument sounds. Journal

of New Music Research, 32(1).

Jain, A. K., Duin, R. P., and Mao, J. (2000). Statistical

pattern recognition: A review. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 22(1):4–

37.

Lakatos, S. (2000). A common perceptual space for har-

monic and percussive timbres. Perception & Psychoa-

coustics, (62):1426–1439.

Manjunath, B. S., Salembier, P., and Sikora, T. . (2002). In-

troduction to MPEG-7. Multimedia Content Descrip-

tion Interface. John Wiley & Sons, LTD.

Miller, G. A. (November 1995). WordNet: A lexical

database for english. Communications of the ACM,

pages 39–45.

Ricard, J. and Herrera, P. (2004). Morphological sound de-

scription: Computational model and usability evalua-

tion. In Proc.116th AES Convention, Berlin, Germany.

Schaeffer, P. (1966). Trait des Objets Musicaux. Editions

du Seuil.

Weis, E. (1995). Sync tanks: The art and technique of post-

production sound. Cineaste, 21(1):56.

ICETE 2004 - WIRELESS COMMUNICATION SYSTEMS AND NETWORKS

308