device such as an iPhone or iPad, that performs image
analysis and reproduces student defined musical tex-
tures or planes. This tool was used in both secondary
school students, in a science summer school, and
higher education students, in the masters on Teach-
ing of Musical Education in the Basic School. It is
through experimentation and careful reflection that
the student will learn the principles underneath sonifi-
cation, try art installations, interactive performances,
and others.
2 MACHINE LEARNING
Digital image processing has been a challenge for
several decades. Nevertheless, the importance of
this field has pushed research towards significant ad-
vances throughout the years. Both the technologi-
cal advances and the adoption of new paradigms has
been making image processing faster, more precise
and more useful. There are several problems that are
usually addressed through digital image processing
(Egmont-Petersen et al., 2002):
1. Preprocessing/filtering - the output is the same
type as the input, with the objective of improving
contrast or reduce noise;
2. Data reduction/feature extraction - the resulting
data usually are smaller and indicative of signif-
icant components in the image;
3. Segmentation - partition and selection of regions
in the image within some criterion;
4. Object detection and recognition - determining
the location, orientation and classification of spe-
cific objects in the image;
5. Image understanding - getting semantic knowl-
edge of the image
6. Optimization - improving image characteristics.
The emergence of deep learning and convolutional
neural networks (CNN) demonstrated significant ad-
vances in many of these tasks (LeCun et al., 2015).
CNNs represent computational models composed of
multiple processing layers, where each layer learns
different levels of representations of data. An image
is represented by a three dimensional array, where the
first two dimensions describe the 2D distribution of
pixels, and the third dimension contains information
about the color of each pixel. The CNN will learn
the features from the images through an intense learn-
ing process. The first layer will typically represents
the presence or absence of edges at particular orienta-
tions and locations. The second layer detects arrange-
ments of edges, and the third layer assembles these
into larger combinations that correspond to parts of
familiar objects. Subsequent layers would detect ob-
jects as combinations of these parts. The key aspect is
that these layers of features are learned from data us-
ing a general-purpose learning procedure, instead of
relying on a human engineering process.
In this paper, we are mainly interested in object
detection and recognition (point 4 in the above list).
Object recognition in images is a complex task, not
only because of the intrinsic complexity of the pro-
cess but also because of the semantic information they
convey (Uijlings et al., 2013).
Humans are able to recognize, with ease, both the
location of objects, their type and even their function.
To replicate this process in a computer, it is necessary
that the algorithm is able to mark a region in the image
(segmenting the object) and perform a classification
(to identify the type of the object).
There has been a huge progress over the last years
in this field. One popular approach to both identify-
ing and recognizing objects in a single operation is
YOLO - You Only Look Once (Redmon et al., 2016).
This approach uses a single CNN to simultaneously
predict multiple bounding boxes and class probabili-
ties for those boxes, following a regression problem.
This makes the algorithm very fast and able to run in
a portable device, such as a tablet or smartphone.
3 METHODOLOGY
The purpose of this work is to develop a mobile ap-
plication, which we called SeMI
1
, targeting iOS de-
vices, that can detect and recognize objects in images
collected with the camera. Each time an object is rec-
ognized, a sound texture is played. If multiple objects
are recognized, the associated texture is mixed with
the output sound.
Objects are specified by the user in a virtual set-
ting, a representation on the screen, where the user
can drag, drop, resize and move several objects. The
user can also associate a sound file (WAV, MP3, AAC,
and other formats) to each object.
Based on this, a software application was de-
signed, composed of four main modules (Figure 1).
Images are captured with the camera and the im-
age is rotated and scaled according to the orientation
of the device and the requirements of the object recog-
nition input. The YOLO module is constantly ana-
lyzing the images and outputs a list of vectors, each
with the bounding box coordinates, the class of the
object and the associated probability of the classifica-
1
Acronym resulting from “Interactive Musical Setting”.
CSME 2020 - Special Session on Computer Supported Music Education
664