Interactive Musical Setting with Deep Learning and Object Recognition

ario Cardoso

1 a

and Rui Pedro Lopes

2 b

Research Center for Basic Education, Instituto Polit

ecnico de Braganc¸a, Portugal

Research Center for Digitalization and Industrial Robotics, Instituto Polit

ecnico de Braganc¸a, Portugal

Keywords:

Deep Learning, Object Recognition, Musical Setting, Musical Textures, Musical Education.

Abstract:

The SeMI - Interactive Musical Setting, explores the possibilities of joining machine learning, the physical and

the sound world. In this context, a machine learning algorithm and model was used to identify physical objects

through image processing. Each physical object is associated with a student’s produced musical texture that

starts playing when the object is recognized by the device. This allows deﬁning use cases in which students

have to develop diverse although interrelated sound textures and combine them with a physical world, in both

a fake orchestra, that reacts to people and objects in front of it, and mood rooms, for example. The application

was developed for iPad and iPhone, using Swift programming language and the iOS operating system and

used in the classes of the masters on Teaching of Musical Education in the Basic School.

1 INTRODUCTION

Over the last twentieth century, the nature of music

itself has changed dramatically (Hargreaves and La-

mont, 2017). Several musicologists document many

revolutions and transformations: (i) the fresh and

innovative music of Beethoven, Wagner, Debussy,

Stravinsky, Schoenberg and Messiaen; (ii) the de-

velopments of electronic sound production; (iii) the

transformations and discussions of iconoclastic com-

posers (Stockhausen, Boulez, Cage, Glass, Reich)

and the impact in music performance and composi-

tion; (iv) the inﬂuence and power of the different gen-

res and styles; and (v) the digital revolution (to pro-

duce, record and transmit) and the new sounds and

effects. This last aspect had deep effects in the peo-

ple’s lives and in particular in the way musicians (and

non-musicians) perform and compose music (Car-

doso et al., 2019; Ruthmann and Mantie, 2017).

Nowadays, being a music teacher involves more

skills: working knowledge of music software and

hardware; arranging or improvising; arts technology

(interactive art, computer programming, virtual and

augmented reality). For Brown (2015), it is necessary

to reinterpret the nature of musical experience. The

digitisation of music means that we have to change

and reinvent the paradigm of teaching and learning

https://orcid.org/0000-0003-3645-9641

https://orcid.org/0000-0002-9170-5078

music. For Peppler (2017), this positive vision radi-

cally shifts the lines between performer, listener and

composer. In music education, technology can be a

catalyst that contribute to expand the process of teach-

ing and learning music into a more comprehensive,

creative, innovative and imaginative experience (Car-

doso et al., 2019; Brophy, 2001). In this framework,

musical learning can beneﬁt from the development

and adoption of innovative devices and tools that fos-

ter their autonomous work and help them overcome

the difﬁcult task of learning to play and compose mu-

sic. It is important that the process puts great em-

phasis on the student, encouraging higher education

institutions and academic staff to place students at

the center of their thinking and to help them man-

age their expectations and be able to consciously and

constructively design their learning paths through-

out their higher education experience (Lopes et al.,

2019; Tenorio et al., 2018). If the education is one

of the most important aspects of human development,

greatly inﬂuencing the path of professional develop-

ment and success (Mesquita et al., 2015, 2014), it

is important to increase the training at the higher-

education level, that contributes to the scientiﬁc and

technological qualiﬁcation of youth and adults, to-

wards the “creative, innovative and competitive de-

velopment, with high productivity standards” (Cor-

reia and Mesquita, 2006, 166).

This paper describes the development of a ma-

chine learning based application, that runs in an iOS

Cardoso, M. and Lopes, R.

Interactive Musical Setting with Deep Learning and Object Recognition.

DOI: 10.5220/0009856406630667

In Proceedings of the 12th International Conference on Computer Supported Education (CSEDU 2020), pages 663-667

ISBN: 978-989-758-417-6

 2020 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

663

device such as an iPhone or iPad, that performs image

analysis and reproduces student deﬁned musical tex-

tures or planes. This tool was used in both secondary

school students, in a science summer school, and

higher education students, in the masters on Teach-

ing of Musical Education in the Basic School. It is

through experimentation and careful reﬂection that

the student will learn the principles underneath soniﬁ-

cation, try art installations, interactive performances,

and others.

2 MACHINE LEARNING

Digital image processing has been a challenge for

several decades. Nevertheless, the importance of

this ﬁeld has pushed research towards signiﬁcant ad-

vances throughout the years. Both the technologi-

cal advances and the adoption of new paradigms has

been making image processing faster, more precise

and more useful. There are several problems that are

usually addressed through digital image processing

(Egmont-Petersen et al., 2002):

1. Preprocessing/ﬁltering - the output is the same

type as the input, with the objective of improving

contrast or reduce noise;

2. Data reduction/feature extraction - the resulting

data usually are smaller and indicative of signif-

icant components in the image;

3. Segmentation - partition and selection of regions

in the image within some criterion;

4. Object detection and recognition - determining

the location, orientation and classiﬁcation of spe-

ciﬁc objects in the image;

5. Image understanding - getting semantic knowl-

edge of the image

6. Optimization - improving image characteristics.

The emergence of deep learning and convolutional

neural networks (CNN) demonstrated signiﬁcant ad-

vances in many of these tasks (LeCun et al., 2015).

CNNs represent computational models composed of

multiple processing layers, where each layer learns

different levels of representations of data. An image

is represented by a three dimensional array, where the

ﬁrst two dimensions describe the 2D distribution of

pixels, and the third dimension contains information

about the color of each pixel. The CNN will learn

the features from the images through an intense learn-

ing process. The ﬁrst layer will typically represents

the presence or absence of edges at particular orienta-

tions and locations. The second layer detects arrange-

ments of edges, and the third layer assembles these

into larger combinations that correspond to parts of

familiar objects. Subsequent layers would detect ob-

jects as combinations of these parts. The key aspect is

that these layers of features are learned from data us-

ing a general-purpose learning procedure, instead of

relying on a human engineering process.

In this paper, we are mainly interested in object

detection and recognition (point 4 in the above list).

Object recognition in images is a complex task, not

only because of the intrinsic complexity of the pro-

cess but also because of the semantic information they

convey (Uijlings et al., 2013).

Humans are able to recognize, with ease, both the

location of objects, their type and even their function.

To replicate this process in a computer, it is necessary

that the algorithm is able to mark a region in the image

(segmenting the object) and perform a classiﬁcation

(to identify the type of the object).

There has been a huge progress over the last years

in this ﬁeld. One popular approach to both identify-

ing and recognizing objects in a single operation is

YOLO - You Only Look Once (Redmon et al., 2016).

This approach uses a single CNN to simultaneously

predict multiple bounding boxes and class probabili-

ties for those boxes, following a regression problem.

This makes the algorithm very fast and able to run in

a portable device, such as a tablet or smartphone.

3 METHODOLOGY

The purpose of this work is to develop a mobile ap-

plication, which we called SeMI

, targeting iOS de-

vices, that can detect and recognize objects in images

collected with the camera. Each time an object is rec-

ognized, a sound texture is played. If multiple objects

are recognized, the associated texture is mixed with

the output sound.

Objects are speciﬁed by the user in a virtual set-

ting, a representation on the screen, where the user

can drag, drop, resize and move several objects. The

user can also associate a sound ﬁle (WAV, MP3, AAC,

and other formats) to each object.

Based on this, a software application was de-

signed, composed of four main modules (Figure 1).

Images are captured with the camera and the im-

age is rotated and scaled according to the orientation

of the device and the requirements of the object recog-

nition input. The YOLO module is constantly ana-

lyzing the images and outputs a list of vectors, each

with the bounding box coordinates, the class of the

object and the associated probability of the classiﬁca-

Acronym resulting from “Interactive Musical Setting”.

CSME 2020 - Special Session on Computer Supported Music Education

664

Camera

Object

recognition

YOLO

SeMI Engine

Setting Editor

Sound

textures

Database

Objects

Database

Figure 1: Main modules of the application.

tion performed. The YOLO network has 24 convo-

lutional layers followed by 2 fully connected layers,

alternating with 1x1 convolutional layers to reduce

the features space. The input image is scaled to ﬁt

224x224 pixels.

The SeMI Engine crosses the object class with the

previous user assembled setting and, if the object is

present in both locations (image and virtual setting),

the sound texture is played.

4 SeMI

The SeMI application starts by presenting, to the user,

an empty virtual setting (Figure 2).

Figure 2: Initial screen, with an empty setting.

The user is now able to touch anywhere on the screen

to call the object database. This object database is

composed of 80 different objects, that the YOLO

module can recognize (Figure 3).

After placing, resizing and moving all the required

objects, the user can select the sound ﬁles to associate

with each object (Figure 4).

In the end, the virtual setting, with all the required

objects and the associated sound ﬁles, is complete

(Figure 5). In this example, the user decided to in-

Figure 3: Placing objects on the virtual setting.

Figure 4: Selecting a sound ﬁle for an object.

Figure 5: Example of a ﬁnished virtual setting.

clude a plant (in a vase), a bottle, a bicycle, a chair

and a person in the virtual setting. When the SeMI

Engine is running and any of these objects is detected

in the physical scene, the predeﬁned sound texture is

played (Figure 6).

In the top left corner, the user can see the im-

ages captured by the camera and the object detection

and recognition results from the YOLO module. The

bounding box is drawn with a label indicating the

Interactive Musical Setting with Deep Learning and Object Recognition

665

Figure 6: Running SeMI Engine.

class and a number that reveals the conﬁdence level

on the recognition.

The application can be used directly by the user,

exploring the physical space when moving or differ-

ent use cases can be considered.

5 USE CASES

In the context of music learning, the objective of

SeMI is to provide an environment where sound tex-

tures are mixed in different planes according to the

objects perceived in an image.

Since an object can be detected at unpredictable

times, even when the device is already producing

sound, the textures should be designed considering

different parameters of sound and music features. The

combination and the overlap of melodic, rhythmic and

harmonic material represent a fundamental aspect in

the process of composition and choice of the music

motives that can be associated with objects. It is im-

portant to ensure complementarity and a certain pro-

cess of randomization and self-generation between

the music motives.

This paves the way for some use cases, starting

with the design and creation of compatible sound tex-

tures. Mainly, the device can be made static or mo-

bile. In the former situation, the device is placed in

a speciﬁc location, with camera facing an area where

people can enter and place physical objects. The ideal

scenario is to build a stage with fake or non-functional

musical instruments as a phantom orchestra. The de-

vice would be hiding inside one of the instruments

and broadcasting the audio output to a set of speak-

ers, through AirPlay, for example.

The fake orchestra reacts to people and objects

standing or moving in front of it, thus building a dy-

namic and involving experience. Without holding an

instrument, people in the audience change the sound

of the orchestra by moving in or changing objects in

front of it.

Another use case is the development of mood

rooms (or classrooms). By transporting the smart-

phone or tablet, people wander through the spaces

and, depending on the decoration and on the objects

available in each room, the sound will change.

More possibilities can be foreseen, always keep-

ing in mind the three main variables: static or mobile

device, combination of physical objects and design

and playing of sound textures.

6 CONCLUSIONS

In this paper we have explored the development of an

application for smartphones and tablets that can re-

act to physical objects recorded by the video-camera

with playing and mixing sound textures. Students are

required to design and record the music textures and

assemble a virtual setting in the device. Whenever the

device recognizes the objects in the ﬁeld of vision, the

audio ﬂows according to the textures that the student

decided to use.

This application can be used to foster music learn-

ing and create scenarios that correlate the physical

world with the sound world, such as a fake orches-

tra that reacts to the physical environment and mood

rooms that reproduce the sound based on the objects

they contain.

The application was developed in Swift for iOS

and resorts to YOLO (You Only Look Once) for ob-

ject detection and recognition.

ACKNOWLEDGEMENTS

This work has been supported by FCT – Fundac¸

para a Ci

encia e Tecnologia within the Project Scope:

UIDB/05757/2020.

REFERENCES

Brophy, T. S. (2001). Developing Improvisation in General

Music Classes. Music Educators Journal, 88(1):34–

53.

Brown, A. R. (2015). Engaging in a sound musicianship.

In McPherson, G. E., editor, The Child as Musician,

pages 208–220. Oxford University Press.

Cardoso, M., Morgado, E., and Silva, E. (2019). Musical

creation experiences in primary education. In Garc

ıa-

Valc

arcel, A., Gonc¸alves, V., Meirinhos, M., Patr

ıcio,

M. R., Rodero, L., and Sousa, J. S. d. P. C., editors,

CSME 2020 - Special Session on Computer Supported Music Education

666

ieTIC2019: livro de atas, pages 228–237, Braganc¸a,

Portugal. Instituto Polit

ecnico de Braganc¸a.

Correia, A. M. R. and Mesquita, A. (2006). Novos p

ublicos

no ensino superior desaﬁos da sociedade do conheci-

mento. Edic¸

oes S

ılabo, Lisboa. OCLC: 494458369.

Egmont-Petersen, M., de Ridder, D., and Handels, H.

(2002). Image processing with neural networks—a

review. Pattern Recognition, 35(10):2279–2301.

Hargreaves, D. and Lamont, A. (2017). The psychology of

musical development. OCLC: 1055841873.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. Nature, 521(7553):436–444.

Lopes, R. P., Mesquita, C., de G

ois, L. A., and dos San-

tos J

unior, G. (2019). Students’ Learning Autonomy:

A Systematic Literature Review. pages 5958–5964,

Palma, Spain.

Mesquita, C., Lopes, R. P., Garc

ıa, J. l., and de la Cruz del

ıo Rama, M. (2015). First Contact with the Word of

Work: The Competence Built in the Teaching Prac-

tices. In Peris-Ortiz, M. and Merig

o Lindahl, J. M.,

editors, Sustainable Learning in Higher Education SE

- 6, Innovation, Technology, and Knowledge Manage-

ment, pages 75–87. Springer International Publishing.

Mesquita, C., Lopes, R. P., Garc

ıa, J. l., and Rama, M. d.

l. C. d. R. (2014). Pedagogical Innovation in Higher

Education: Teachers’ Perceptions. In Peris-Ortiz, M.,

Garrig

os-Sim

on, F. J., and Gil Pechu

an, I., editors,

Innovation and Teaching Technologies: New Direc-

tions in Research, Practice and Policy, pages 51–60.

Springer International Publishing, Cham.

Peppler, K. (2017). Interest-Driven Music Education:

Youth, Technology, and Music Making Today. In

Mantie, R. and Ruthmann, A., editors, The Oxford

Handbook of Technology and Music Education, pages

190–202. Oxford University Press.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You Only Look Once: Uniﬁed, Real-Time

Object Detection. pages 779–788. IEEE.

Ruthmann, A. and Mantie, R., editors (2017). The Oxford

handbook of technology and music education. Oxford

handbooks. Oxford University Press, New York, NY.

Tenorio, M., Reinaldo, F., Esperandim, R., Lopes, R., Gois,

L., and Dos Santos Junior, G. (2018). C

eos: A collab-

orative web-based application for improving teaching-

learning strategies. In Advances in Intelligent Systems

and Computing, volume 725, pages 107–114.

Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., and

Smeulders, A. W. M. (2013). Selective Search for Ob-

ject Recognition. International Journal of Computer

Vision, 104(2):154–171.

Interactive Musical Setting with Deep Learning and Object Recognition

667