have used 60% of instances to train the SVM and the
other 40% to test it. To obtain statistically significant
results we have performed a 10-fold cross validation.
We have found that this system is able to recog-
nize the following events with an overall accuracy of
73%: someone falling down, slice, screaming, rain,
printer, people talking, frying food, filling water, door
knocking, dog bark, car horn, glass breaking, baby
crying, water boiling.
Figure 7 shows the confusion matrix.
38 CAPÍTOL 5. DESENVOLUPAMENT DEL PROJECTE
En la següent imatge es pot veure la jerarquia entre les crides de les eines utilitzades. També es
pot veure on s’executa cada un dels mòduls.
Figura 5.1: Jerarquia de crides segons el modul
En l’anterior iteració s’ha comentat com el sistema classifica correctament alguns elements
mentre que d’altres els confon entre si. Per aquest motiu, a partir d’aquest punt del projecte el
que es vol és aprofitar la informació que dóna la matriu de confusió per tal de separar en capes
la classificació. En cada una de les capes es classifiquen uns elements i els altres són agrupats
en una super classe que els engloba. Els elements que siguin classificats com a super classe
hauran de tornar-se a classificar passant per un nou classificador configurat perquè la separació
sigui més bona que amb el genèric. El que s’espera aconseguir és que la separabilitat d’aquests
elements incrementi en el moment de ser classificats en el segon classificador.
En la següent imatge es pot apreciar la matriu de confusió utilitzada per a separar els elements
a classificar.
Figura 5.2: Matriu de confusió en la primera fase de la classificació
Figure 7: Confusion Matrix. Events are ordered from
left to right as follows: falling down, slice, screaming,
rain, printer, people talking, frying food, filling water, door
knocking, dog bark, car horn, glass breaking, baby crying,
water boiling.
In this confusion matrix we can see how often the
SVM misclassifies a given class and, thus, assigns a
wrong event to an audio sample. It is shown that in
general, the best results for each sample are obtained
when testing the sound event against itself. Also, it
depicts the skill of the classifier on distinguishing one
audio event from the others. The optimal value of this
confusion matrix should be an Identity Matrix with
the value 100 on its diagonal.
Although the classifier performs reasonably well,
it gets confused on some sound events that have sev-
eral MFCC coefficients pretty similar. For instance,
on row 6 in Fgure 7, door knocking, people talking
and frying food have similar MFCC vector patterns
and, thus, the SVM features a low accuracy in these
specific situations. To address this concern, we plan
to (1) complement the training vector set with other
sources in addition to MFCCs, and (2) use a more so-
phisticated classifier such as a deep net.
6 CONCLUSIONS
Preliminary results of our paper encourages us to keep
on working on the analysis of the events happening in
the house. We will work with the feature extraction
improvement with other methods, as well as we will
test more machine learning algorithms to increase the
accuracy of the system with just one acoustic mea-
surement. Next steps after this proof of concept using
the Jetson TK1 are the expansion of the platform, by
means of using a wider sensor network, where sev-
eral autonomous acoustic sensors sending data to the
GPU to be processed. In this stage, an important part
of the work will be focused on the optimization of the
acoustic event detection algorithm to take advantage
of the parallelization of the GPU unit.
ACKNOWLEDGEMENTS
The authors would like to thank the Secre-
taria d’Universitats i Recerca del Departament
d’Economia i Coneixement (Generalitat de
Catalunya) under grant ref. 2014-SGR-0590.
REFERENCES
Al
´
ıas, F., Socor
´
o, J. C., and Sevillano, X. (2016). A review
of physical and perceptual feature extraction tech-
niques for speech, music and environmental sounds.
Applied Sciences, 6(5):143.
Chachada, S. and Kuo, J. (2014). Environmental sound
recognition: A survey. APSIPA Transactions on Sig-
nal and Information Processing.
Chan, M., Est
`
eve, D., Escriba, C., and Campo, E. (2008).
A review of smart homespresent state and future
challenges. Computer methods and programs in
biomedicine, 91(1):55–81.
Chatterji, S., Kowal, P., Mathers, C., Naidoo, N., Verdes,
E., Smith, J. P., and Suzman, R. (2008). The health of
aging populations in china and india. Health Affairs,
27(4):1052–1063.
Chen, J., Kam, A. H., Zhang, J., Liu, N., and Shue, L.
(2005). Bathroom activity monitoring based on sound.
In International Conference on Pervasive Computing,
pages 47–61. Springer.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern clas-
sification. IEEE Transactions on Information Theory,
13 (1): 2127.
CUI inc. (2003). CMA-4544PF-W. [Online; accessed 10-
Dec-2016].
Fu, Z., Lu, G., Ting, K. M., and Zhang, D. (2011). A sur-
vey of audio-based music classification and annota-
tion. IEEE Transactions on Multimedia, 13(2):303–
319.
Goetze, S., Schroder, J., Gerlach, S., Hollosi, D., Appell,
J.-E., and Wallhoff, F. (2012). Acoustic monitoring
and localization for social care. Journal of Computing
Science and Engineering, 6(1):40–50.
Guyot, P., Pinquier, J., Valero, X., and Alias, F. (2013).
Two-step detection of water sound events for the
diagnostic and monitoring of dementia. In 2013
IEEE International Conference on Multimedia and
SENSORNETS 2017 - 6th International Conference on Sensor Networks
186