erator that counts occurrences of discrete values
(hotel names, in our example). Temporal aggre-
gators have to be followed by the fetch operator,
which determines the points in time where aggre-
gate results are actually produced (in the example,
every 3 months; the computation might of course
be done continuously). Thus, two or more aggre-
gators can be synchronized. The purpose of ag-
gregators is to lower the criticality as well as the
volume of the generated data.
Filters. A filter’s purpose is to reduce the data set by
removing attribute values which fulfill a certain
condition.
Perturbators. Perturbators distort attribute values
by, e.g., adding random noise. They reduce the
criticality by rendering values unusable with re-
spect to a specific device, but still allow to com-
pute aggregate values for many devices.
Interpreters. An interpreter enriches data semanti-
cally by using publicly available information, and
is always something that could be done also at
the consumer’s site. However, the resulting data
might offer a better trade-off between utility and
privacy and should thus be part of the standard-
ized processing stage. In our example, the time-
location pair is highly critical and would thus have
to be filtered or perturbed, which renders it unus-
able for hotel rating. However, if we use an inter-
preter to transform the numeric location into a se-
mantic location (e.g., via a geocoder that allows
reverse geocoding), filter everything besides ho-
tels, and use a counting aggregator to record visit
counts for every hotel, we can preserve data util-
ity, while drastically reducing its criticality.
This list is certainly not complete. Besides these oper-
ators, we could also think of mathematical (stateless)
functions, which might also influence the criticality.
Everything that goes into the transmit node in
Figure 2, is sent to interested consumers, together
with a timestamp. An ID for the data stream is
needed, when individual patterns have to be moni-
tored. However, in adherence to our modified stream
semantics principle, the ID is generated randomly and
not a constant value. In fact, the change frequency
of an ID (in our example: one year) has to be con-
sidered during criticality assessment. We have, of
course, to assume that there is a data transport mech-
anism that does not allow linking two stream chunks
with different IDs to the same device. This, however,
is our only assumption regarding the external infras-
tructure’s trustworthiness.
3.3 Architecture Sketch
Collective Apps, as well as general consumers is-
sue queries as described in the previous section. As
depicted in Figure 3, queries are subject to privacy
checks before actually executed. Queries make use of
data sources and operators. Although it is not within
the scope of this paper, we see energy-efficient sensor
management, e.g., for location sensing (Zhuang et al.,
2010) as a crucial ingredient here. Some operators or
data sources might access publicly available knowl-
edge, like the geocoder from our example. Both the
generated data as well as requests for supplementary
data are subject to energy-efficient lazy transmission
scheduling, like it is done in (Ra et al., 2010).
Keeping the privacy issues inside the device, the
off-device infrastructure’s main challenge is to ensure
fairness between all data consumers. It has to in-
telligently distribute queries to devices all over the
planet, keeping track of them and transporting the
generated data back to the consumers. Additionally
similar queries should be combined to reduce the mo-
bile data transfer volume and thus the individual en-
ergy consumption, allowing every consumer to query
a larger set of devices. Due to the large number of
devices, consumers and queries, the common infras-
tructure itself has to be a distributed system, which
brings its own challenges.
4 RESEARCH DIRECTIONS
To wrap it up, we present the most crucial research
challenges in the pursuit of this vision.
Power models for mobile devices are necessary to
assign energy costs to queries (transparency and con-
trol). Although much work has been done to model
power consumption of hardware components (Zhang
et al., 2010), the cross-layer nature of data collection
yet poses a challenge.
Aggregation and Perturbation are well known
concepts in the data mining community and are
amenable for distributed execution. (Agrawal and
Haritsa, 2005) However, most research is focused
globally on static databases instead of locally on the
data generation site.
Although there is work on privacy quantification
(Venkatasubramanian, 2008; Agrawal and Aggarwal,
2001) for selected privacy-preserving data mining
techniques, a comprehensive approach that allows to
quantify chains of operators does not yet exist. Fur-
thermore, the resulting measures often depend on the
concrete data. Another approach would be to come up
with a model for data- and operator-specific privacy
PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems
84