the need to provide raw content: the only informa-
tion that leaves the confines of the integrated unit is
an abstract ADL log. Although such information still
needs to be managed in full accordance to guidelines
pertaining private data, the level of obtrusiveness is
greatly reduced by the assurance that no unwarranted
analysis or recording can conceivably be done.
Our system is designed to satisfy two key require-
ments: that the analysis algorithms are computation-
ally efficient so that they can be implemented for
the Raspberry PI device; and that they can be tuned
and configured for different acoustic environments by
technicians without machine learning expertise. In
order to evaluate the proposed approach on these re-
quirements, we motivate and present a use case based
on bathroom usage (Section 3) and draw conclusions
(Section 3).
2 PROPOSED METHOD
2.1 Overall Architecture
The main part of the whole system is a microphone-
equipped Raspberry PI single-board PC that is used
for all data acquisition and processing. Its small-form
factor, low energy consumption and low overall cost
make it ideal for installing it in any room/area we
want to monitor and its processing power is enough
for running our algorithms in real time. In our ex-
periments we used a Raspberry PI model B with a
Wolfson audio card.
Communication to/from the PC is made using
the MQTT machine-to-machine communication pro-
tocol. MQTT is a lightweight messaging protocol
that implements the brokered publish/subscribe pat-
tern, created widely used in IoT applications. With-
out going into technical details, the main idea is that
when connected to a specified MQTT ‘broker’, vari-
ous machines/applications can send messages under a
certain topic and others can listen to these when ”sub-
scribed” to these topics. In our use case, it is used
both for sending commands to the Raspberry PI (for
example to start/stop recording) and for remotely re-
ceiving the processing results.
For this purpose, two MQTT clients were imple-
mented: The first is installed in the Raspberry PI and
is subscribed to a ”command” topic in order to receive
requests for collecting training data, building audio
classes models and finally use them for real-time clas-
sification. The second one is bundled in an Android
application and is used for sending remotely the cor-
responding commands and listening to the classifica-
tion results. The system is designed with ease of use
in mind and the only set-up needed is connecting the
two clients to the same broker. By having a dedicated
broker this step can be performed automatically, mak-
ing the whole system ‘plug-and-play’.
2.2 System Calibration
Once setup, the system has to go through a training
phase in order to be used for real-life scenarios. This
includes recording, feature extraction, manual anno-
tation of the recorded events and classifier tuning /
training. Figure 1 shows the proposed calibration
procedure. During this phase, the various events are
recorded using the Android application as a remote
controller of the Raspberry PI device that makes the
actual recording and further processing. An audio file
is created on user’s demand and the user/technician
is informed about the categories and durations of al-
ready recorded data. He then provides the current
recording’s label (e.g. ”door bell”). When a reason-
ably large amount of data is gathered (typically about
1-2 minutes of recordings for each category), the tech-
nician uses the mobile application to trigger the train-
ing process (that is also executed on the Raspberry PI
device). After this process, the Raspberry PI is ready
to monitor and recognize sound in the ”learned” envi-
ronment.
2.3 Audio Event Recognition
2.3.1 Audio Features
In total, 34 audio features are extracted on a short-
term basis. This process results in a sequence of
34-dimensional short-term feature vectors. In addi-
tion, the processing of the feature sequence on a mid-
term basis is adopted. According to that the audio
signal is first divided into mid-term windows (seg-
ments). For each segment, the short-term process-
ing stage is carried out and the feature sequence from
each mid-term segment, is used for computing feature
statistics (e.g. the average value of the ZCR). There-
fore, each mid-term segment is represented by a set of
statistics. In this Section we provide a brief descrip-
tion of the adopted audio features. For detailed de-
scription the reader can refer to the related bibliogra-
phy (Giannakopoulos and Pikrakis, 2014) (Theodor-
idis and Koutroumbas, 2008), (Hyoung-Gook et al.,
2005). The time-domain features (features 1 - 3) are
directly extracted from the raw signal samples, while
the frequency-domain features (features 4-34, apart
from the MFCCs) are based on the magnitude of the
Discrete Fourier Transform (DFT). The cepstral do-
main (e.g. used by the MFCCs) results after applying
A Low-cost Approach for Detecting Activities of Daily Living using Audio Information: A Use Case on Bathroom Activity Monitoring
27