System Support for Privacy-preserving and Energy-efﬁcient Data

Gathering in the Global Smartphone Network

Opportunities and Challenges

Jochen Streicher, Orwa Nassour and Olaf Spinczyk

Technische Universit¨at Dortmund, Dortmund, Germany

Keywords:

Privacy, Energy Management, Execution Platform, Collective Apps, Internet Queries.

Abstract:

Smartphones are becoming the predominant mobile computing device of the decade. By end of 2012 more

than 150 million devices have already been sold and the yearly market growth is about 40%. The ubiquity of

the smartphone creates tremendous opportunities for collecting and mining data from end users. Smartphones

are becoming a global sensor network that could answer questions about user (customer) behavior and their

interests, device status, security threats, and a vast amount of derived information such as CO

2 footprints,

trafﬁc conditions, etc. However, so far only very few market-leading companies, such as Apple and Google,

are able to exploit this data source. Even though technologically possible, end user concerns such as privacy

protection, energy consumption, and the general lack of incentives, make it difﬁcult for smaller companies

and private app developers to make use of the smartphone network. This paper will present the vision of

an open system support platform for running ﬂexible “Internet Queries” and “Collective Apps” in the global

smartphone network. We analyze the problems of the current state of the art, derive platform requirements,

and sketch the envisioned platform’s architecture. The discussion will culminate in a list of important research

directions to be followed.

1 VISION

The proliferation of smartphones and similar devices

in home and ofﬁce environments is a big step to-

wards Mark Weiser’s vision of Ubiquitous Comput-

ing (Weiser, 1991). They provide sensors, actuators,

and decent computational power in everyday situa-

tions and share the Internet as a global communica-

tion infrastructure. It would be a waste of resources

if we used these devices only for individual end users

instead of exploiting the world’s biggest “sensor net-

work” to answer questions of global scale, which we

have never been able to answer before. Figure 1 il-

lustrates the gain of information by mining sensor

data and locally derived data from the vast number

of smartphones in the Internet. As applications of this

data are so manifold, the ﬁgure contains only a few

examples and does intentionally not strive for com-

pleteness.

Our discussion of the state of the art in Section

2 will show that a (small) number of market lead-

ing companies are already using this tremendous data

source. Several research projects are also interested,

but have only very limited access. Smaller companies

Figure 1: Examples for sensor data taken on smartphones,

locally derived data, and information that would be avail-

able if the data was collected from the whole smartphone

network (or large fractions).

and private app developers have virtually no chance

to access the smartphone network. It is interesting to

note here that the problem is–at ﬁrst glance–not tech-

nical. Apps downloaded from a market can techni-

cally access a smartphone’s sensors and transmit the

data via the Internet to a central server. It is more a

Streicher J., Nassour O. and Spinczyk O..

System Support for Privacy-preserving and Energy-efﬁcient Data Gathering in the Global Smartphone Network - Opportunities and Challenges.

DOI: 10.5220/0004339200800085

In Proceedings of the 3rd International Conference on Pervasive Embedded Computing and Communication Systems (PECCS-2013), pages 80-85

ISBN: 978-989-8565-43-3

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

matter of trust, transparency, simplicity, and incen-

tives. We will revisit these points throughout this pa-

per.

The key enabler for our vision is a powerful sys-

tem software that solves the aforementioned prob-

lems. Section 3 will present a sketch of its architec-

ture, while Section 4 will discuss related work and

essential future directions of research that is needed

for the solution and its optimization. If we assume

that the necessary system support was available, two

new classes of smartphone applications could be de-

veloped, which potentially have huge market impact:

Internet Queries. Companies, developers, and

communities could formulate and run queries as if

the smartphone network was their own huge sensor

network. Of course, end-user privacy protection does

not allow to access personal information, such as in-

dividual location traces, but in most cases that is not

relevant for large-scale data mining anyway. As an

example consider a query that determines the global

market share of smartphone hardware vendors. This

information is relevant for various stakeholders who

would clearly accept to pay for it and thus compen-

sate the resources consumed by the query on the data

providers’ phones.

Collective Apps. A new class of apps might appear

in the markets that do not only process data locally,

but analyze the data from all users on central servers.

As an incentiveto provide their data, users will beneﬁt

from the results of the analysis. For example, a hotel

ﬁnding app could collect hotel visit statistics from its

users. Based on this data, hotels visited more than

once by the same user could be marked, because this

might indicate customer satisfaction.

The two examples mentioned above show what in-

centives we have envisioned for data providers and

data consumers. They also show that privacy pro-

tection is crucial for the success of these application

classes. We assume that a data provider will trust his

hardware and the system software, but not the vari-

ous data consumers. However, as long as an Inter-

net Query or Collective App does not consume too

many resources–especially battery power–and as long

as it does not transmit critical data, end users are not

harmed by executing it on their phone. However,

transparency with respect to energy consumption and

to the criticality of transmitted data is an important

platform requirement. For the sake of simplicity, data

providers shall be able to conﬁgure their privacy re-

quirements, i.e. the amount of data they are willing to

provide, as simple rules that can be used by the plat-

form in order to accept or reject queries automatically

on behalf of the user. This reliefs the data provider

from the burden to understand and accept various dif-

ferent “terms and conditions” with individual privacy

regulations.

2 STATE OF THE ART

Today most apps provide only local services (e.g.

games) or solely use the Internet to provide users with

a convenient interface to access centrally stored data.

Collective Apps and Internet Queries use real-time or

historical data from a community of mobile users in

order to offer improved services or for data mining

purposes. In this section we present two examples

that can be regarded as prototypes of Collective Apps

and Internet Query applications. Based on these ex-

amples we revisit the concerns about privacy, trans-

parency, simplicity, and the lack of incentives that we

have mentioned in the previous section from the data

provider’s and consumer’s points of view.

Google Maps

. is a web-based map service appli-

cation and technology, which gives the mobile user

the ability to look up his current position on a map of

the environment and to use navigation services. The

location information is determined using one of the

positioning technologies installed on the device such

as GPS, cellular identiﬁcation, or even its IP address.

By transmitting their location data, users can get the

possible routes to their desired destination. Google

Maps collects user’s data in order estimate how much

time the user needs to reach his destination. Data from

other smartphones is used to evaluate the current and

future trafﬁc situation. Therefore, Google Maps can

be considered as an early Collective App.

Cambridge Device Analyzer

. is an example for

the ﬁrst prototypes for Internet Queries, whose pur-

pose is research. It collects usage statistics from

geographically distributed devices for statistical pur-

poses. Data from all participants are aggregated at a

central server. This data will then be ﬁltered and ex-

amined in order to extract useful information from it.

Location information is collected by sending the ID

of the GSM cell to which the phone is connected.

2.1 Trust

Currently, smartphone users either have to accept that

an app will have access to certain sensors and system

http://en.wikipedia.org/wiki/Google Maps

http://deviceanalyzer.cl.cam.ac.uk/

SystemSupportforPrivacy-preservingandEnergy-efficientDataGatheringintheGlobalSmartphoneNetwork-

OpportunitiesandChallenges

state or they will not be allowed to use it. There-

fore, many people just accept, even though a bad

gut feeling remains, because they do not really trust

the app developers. Users especially have no inﬂu-

ence on the data that will be transmitted to servers

in the Internet. Other users just do not accept. This

gives big (more “trustworthy”) companies an advan-

tage over small companies or research projects: Based

on the latest statistical expectations by market ana-

lysts, Google and Apple, combined, will capture 98%

of the worldwide mobile market by the end of 2012.

Consequently, this is no problem for Google Maps,

but an important issue for, e.g., the Cambridge De-

vice Analyzer.

The Collective Apps that we are aware of typically

assume that a trusted centralized server or coordina-

tor exists, which collects data from all devices in one

database and anonymizes it before use. However, cen-

tral aggregation and anonymization of data is prob-

lematic, because it increases the number of parties a

user has to trust: For example, the server might be lo-

cated in an untrusted cloud or the communication link

could be not properly secured.

Many proposals have been suggested to avoid this

problem, such as introducing a trusted proxy as a mid-

dle layer between the data sources and the data con-

sumer, which anonymize all data transferred. Yet,

eventually all this boils down to the simple fact that,

as soon as the data has left the smartphone, the user

no longer has control over his data and the analyses

performed on it.

2.2 Transparency

Another cause of privacy concerns, which is related to

trust, is the lack of adequate control over the disclo-

sure of real-time personal information. Most of the

existing location-sharing apps do not detail their poli-

cies for the collection and use of personal informa-

tion. Likewise, Google’s privacy policy ensures that

user information is shared across Google’s network of

sites, which means that the participants’ data is being

collected and processed. It is not yet clear which data

is actually being transmitted, since the whole process

is obscured. In the case of Apple, two developershave

recently discovered that the iPhone has been regularly

recording the device’s location since the introduction

of iOS 4. There is not enough transparency of data

processing in today’s smartphone applications.

A second important transparency issue is the re-

source consumption of the device while collecting

data, especially with respect to battery power. Most

Collective Apps in use today do not take any con-

sideration for conserving the data providers’ battery

power and do not provide an estimate of the overall

resource consumption the user has to expect. For ex-

ample, frequent location updates from and to the mo-

bile deviceconsume a signiﬁcant amount of its battery

power. Users will not accept Collective Apps if they

have to live with unexpected battery drain.

2.3 Simplicity

Most Internet Queries and Collective Apps today ask

the participant to accept the terms of data processing

and anonymization. However, these documents are

usually long and complicated. Reading the individ-

ual statements is time consuming, tedious, and dif-

ﬁcult for non-experts in law and privacy-preserving

data mining. Google’s new (simpliﬁed) privacypolicy

has about 2300 words, not including product-speciﬁc

regulations. The description of the Device Analyzer

has “only” about 800 words. It should be clear that

dealing with each Collective App separately does not

scale.

2.4 Incentives

Some location-based apps, like Google Maps, receive

position information from its participants while pro-

viding them with the desired navigation service in re-

turn. Other apps, e.g. the Cambridge Device analyzer,

collect data solely for research purposes, in which

the data provider might not be interested. Thus, it

provides no direct incentive for the data providers to

share their information. Therefore, only enthusiasts

participate in this effort.

3 PLATFORM PROPOSAL

Having identiﬁed the roadblocks for Collective Apps

and Internet Queries in the current state of the art, we

will now come up with a set of design principles we

deem suited to address those problems. With these

principles in mind, we will then proceed to sketch

how the respective queries may look like and coarsely

describe the envisioned platform’s architecture.

3.1 Design Principles

Simple Absolute Control. Transparency is not

only about knowing what applications are doing with

a provider’s device and data, but also incorporates

control. Obviously, this comprises the possibility of

opting out of data collection temporarily or perma-

nently at any time. However, providers should also

be able to declare in a more ﬁne-grained manner how

PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

much they are willing to give. Regarding energy con-

sumption, this means control over the additional bat-

tery drain that is caused by data collection. With re-

spect to privacy, a provider should be able to declare

the extent of potentially privacy-critical data he is

willing to disclose. Deﬁning a list of non-disclosable

attributes or even more sophisticated rules should of

course be possible. But regarding the need for sim-

plicity, we need a means to quantify the potential pri-

vacy threat that stems from collected data.

Device-based Privacy. Because we explicitly want

to avoid the need for signing a contract with every

possible data consumer, we have to assume that data

transmitted from the device is accessible for eve-

ryone, forever. Although this sounds like a horri-

ble scenario, we believe that it is actually not too

far away from the current state. Designing our on-

device platform with this assumption in mind, we

are completelyindependentof any external infrastruc-

ture with respect to privacy. Thus, the data provider

only needs to trust his own device, which simpliﬁes

the trust requirement. This, of course, makes classi-

cal anonymization procedures that are based on data

from many individuals more difﬁcult if not impos-

sible at all. However, many envisioned queries are

still possible, since non-critical data like the battery

charge level can also provide important insights if col-

lected on such a large scale. Besides that, research on

privacy-preserving data mining techniques like per-

turbation (Agrawal and Haritsa, 2005) shows that it

is indeed possible to get aggregate values of sensitive

data without abandoning individual privacy.

Flexible Privacy. It is widely accepted that there is

no general method for privacy-preserving data pub-

lishing that also preserves the utility of the data. Gen-

erally spoken, privacy and utility is only possible if

you already know the analysis procedure for the data.

Thus, anyone wishing to issue an Internet Query or

writing a Collective App should be able to specify the

privacy-preservingprocedures himself in orderto also

preserve data utility. To satisfy the need for trans-

parency, the query has to be analyzable with respect

to the quantiﬁcation of the potential privacy threat.

Modiﬁed Stream Semantics. Most work on

privacy-preserving data mining focuses on static,

existing, data. The continuous stream of data from

smartphones poses an additional challenge. Existing

work on privacy-preserving stream mining considers

vertically separated data streams that would have

to be joined to gain access to the whole range of

attributes. In our case, however, we have horizontally

Figure 2: A query example: counting hotel visits.

separated data. In the classic database-oriented view,

all data ever produced by a smartphone would form

a row in the database, with billions of attributes,

smashing any attempt of privacy preservation with

the curse of dimensionality, impeding anonymization

(Aggarwal, 2005) and perturbation(Aggarwal, 2007).

Therefore, the data from one device must never be

perceived as one complete stream. Instead, only

small chunks of the provider’s data may be perceived

as a continuous data stream.

Lazy Transmission. Various work has shown that

it is generally beneﬁcial to transfer mobile data rather

infrequently and in larger bursts. It is even better to

wait for more energy-efﬁcient connections or situa-

tions, where the device is plugged into an AC connec-

tor. However, if queries are time-critical, intelligent

scheduling is needed for trading off power consump-

tion against latency.

3.2 Query Structure

As stated in the previous section, data consumers

should be able to ﬂexibly specify the method used to

enhance the privacy of transmitted data. Since we

cannot expect providers to inspect every query for

potential threats to their privacy, we need to quan-

tify the criticality and identiﬁability of the generated

data. Thus a query may only use a deﬁned set of data

sources and processing operators with known seman-

tics. Additionally, the data ﬂow from those sources

through different operators has to be speciﬁed in a

way that is automatically analyzable.

While there are many possible representations, we

use a data ﬂow graph to explain the query structure.

The query for the Collective App example from the

introduction, the hotel visit count, is depicted in Fig-

ure 2. Some operators are designed to enhance the

privacy compared to the raw data. The red ﬁgures af-

ter each processing step are meant to illustrate how

privacy quantiﬁcation might look like in the future;

currently, they are of course just pure ﬁction.

Our proposed query language consists of basic op-

erators that may be chained together:

Aggregators. These operators compute an aggregate

value of a time-dependent discretely sampled in-

put value. There are simple ones, like the comput-

ing the average, and advanced ones, like an op-

SystemSupportforPrivacy-preservingandEnergy-efficientDataGatheringintheGlobalSmartphoneNetwork-

OpportunitiesandChallenges

erator that counts occurrences of discrete values

(hotel names, in our example). Temporal aggre-

gators have to be followed by the fetch operator,

which determines the points in time where aggre-

gate results are actually produced (in the example,

every 3 months; the computation might of course

be done continuously). Thus, two or more aggre-

gators can be synchronized. The purpose of ag-

gregators is to lower the criticality as well as the

volume of the generated data.

Filters. A ﬁlter’s purpose is to reduce the data set by

removing attribute values which fulﬁll a certain

condition.

Perturbators. Perturbators distort attribute values

by, e.g., adding random noise. They reduce the

criticality by rendering values unusable with re-

spect to a speciﬁc device, but still allow to com-

pute aggregate values for many devices.

Interpreters. An interpreter enriches data semanti-

cally by using publicly available information, and

is always something that could be done also at

the consumer’s site. However, the resulting data

might offer a better trade-off between utility and

privacy and should thus be part of the standard-

ized processing stage. In our example, the time-

location pair is highly critical and would thus have

to be ﬁltered or perturbed, which renders it unus-

able for hotel rating. However, if we use an inter-

preter to transform the numeric location into a se-

mantic location (e.g., via a geocoder that allows

reverse geocoding), ﬁlter everything besides ho-

tels, and use a counting aggregator to record visit

counts for every hotel, we can preserve data util-

ity, while drastically reducing its criticality.

This list is certainly not complete. Besides these oper-

ators, we could also think of mathematical (stateless)

functions, which might also inﬂuence the criticality.

Everything that goes into the transmit node in

Figure 2, is sent to interested consumers, together

with a timestamp. An ID for the data stream is

needed, when individual patterns have to be moni-

tored. However, in adherence to our modiﬁed stream

semantics principle, the ID is generated randomly and

not a constant value. In fact, the change frequency

of an ID (in our example: one year) has to be con-

sidered during criticality assessment. We have, of

course, to assume that there is a data transport mech-

anism that does not allow linking two stream chunks

with different IDs to the same device. This, however,

is our only assumption regarding the external infras-

tructure’s trustworthiness.

3.3 Architecture Sketch

Collective Apps, as well as general consumers is-

sue queries as described in the previous section. As

depicted in Figure 3, queries are subject to privacy

checks before actually executed. Queries make use of

data sources and operators. Although it is not within

the scope of this paper, we see energy-efﬁcient sensor

management, e.g., for location sensing (Zhuang et al.,

2010) as a crucial ingredient here. Some operators or

data sources might access publicly available knowl-

edge, like the geocoder from our example. Both the

generated data as well as requests for supplementary

data are subject to energy-efﬁcient lazy transmission

scheduling, like it is done in (Ra et al., 2010).

Keeping the privacy issues inside the device, the

off-device infrastructure’s main challenge is to ensure

fairness between all data consumers. It has to in-

telligently distribute queries to devices all over the

planet, keeping track of them and transporting the

generated data back to the consumers. Additionally

similar queries should be combined to reduce the mo-

bile data transfer volume and thus the individual en-

ergy consumption, allowing every consumer to query

a larger set of devices. Due to the large number of

devices, consumers and queries, the common infras-

tructure itself has to be a distributed system, which

brings its own challenges.

4 RESEARCH DIRECTIONS

To wrap it up, we present the most crucial research

challenges in the pursuit of this vision.

Power models for mobile devices are necessary to

assign energy costs to queries (transparency and con-

trol). Although much work has been done to model

power consumption of hardware components (Zhang

et al., 2010), the cross-layer nature of data collection

yet poses a challenge.

Aggregation and Perturbation are well known

concepts in the data mining community and are

amenable for distributed execution. (Agrawal and

Haritsa, 2005) However, most research is focused

globally on static databases instead of locally on the

data generation site.

Although there is work on privacy quantiﬁcation

(Venkatasubramanian, 2008; Agrawal and Aggarwal,

2001) for selected privacy-preserving data mining

techniques, a comprehensive approach that allows to

quantify chains of operators does not yet exist. Fur-

thermore, the resulting measures often depend on the

concrete data. Another approach would be to come up

with a model for data- and operator-speciﬁc privacy

PECCS2013-InternationalConferenceonPervasiveandEmbeddedComputingandCommunicationSystems

Figure 3: The proposed architecture.

policies. These policies would impose rules on the

selection of data sources, operators and their parame-

ters. Then, instead of deriving a privacy level from a

query, one would map privacy levels to policies.

5 CONCLUSIONS

AND OUTLOOK

In this paper we have presented the vision of a

privacy-preserving and energy-efﬁcient platform for

smartphone applications from an infrastructure point

of view. The platform would exploit theoretical re-

search results from the privacy-preserving data min-

ing community for a very practical goal: An open

global smartphone sensor network. The platform

would enable two novel application classes, namely

Internet Queries and Collective Apps. This might

have a huge impact on smartphone use, for both, the

data providers and data consumers.

The motivation for this work was practical expe-

rience with data collection tasks on smartphones and

the concerns of users we have faced. Our goal is to

perform a step-wise transformation of our own data

gathering infrastructure into the sketched platform. In

parallel we aim at reﬁning the described design. In or-

der to write a self-contained short paper, we have fo-

cused on the smartphone-side of the architecture here.

However, the development of the distributed “com-

mon infrastructure”, which globally processes, opti-

mizes, and dispatches queries, is not less challenging.

ACKNOWLEDGEMENTS

Part of the work on this paper has been supported by

Deutsche Forschungsgemeinschaft (DFG) within the

Collaborative Research Center SFB 876, project A1.

REFERENCES

Aggarwal, C. (2005). On k-anonymity and the curse of

dimensionality. In 31st int. conf. on Very large data

bases, pages 901–909. VLDB Endowment.

Aggarwal, C. (2007). On randomization, public information

and the curse of dimensionality. In IEEE 23rd Int.

Conf. on Data Engineeering, pages 136–145. IEEE.

Agrawal, D. and Aggarwal, C. (2001). On the design and

quantiﬁcation of privacy preserving data mining al-

gorithms. In 20th ACM SIGMOD-SIGACT-SIGART

symp. on Principles of database systems, pages 247–

255. ACM.

Agrawal, S. and Haritsa, J. (2005). A framework for high-

accuracy privacy-preserving mining. In 21st Int. Conf.

on Data Engineering, pages 193–204. IEEE.

Ra, M., Paek, J., Sharma, A., Govindan, R., Krieger, M.,

and Neely, M. (2010). Energy-delay tradeoffs in

smartphone applications. In 8th int. conf. on Mobile

systems, applications, and services, pages 255–270.

ACM.

Venkatasubramanian, S. (2008). Measures of anonymity.

In Aggarwal, C. C., Yu, P. S., and Elmagarmid, A. K.,

editors, Privacy-Preserving Data Mining, volume 34

of Advances in Database Systems, pages 81–103.

Springer US. 10.1007/978-0-387-70992-5 4.

Weiser, M. (1991). The computer for the 21st centrury. Sci-

entiﬁc American, 265(3):94–104.

Zhang, L., Tiwana, B., Qian, Z., Wang, Z., Dick, R.,

Mao, Z., and Yang, L. (2010). Accurate on-

line power estimation and automatic battery behavior

based power model generation for smartphones. In

8th IEEE/ACM/IFIP Int. Conf. on Hardware/software

codesign and system synthesis, pages 105–114. ACM.

Zhuang, Z., Kim, K., and Singh, J. (2010). Improving en-

ergy efﬁciency of location sensing on smartphones. In

8th int. conf. on Mobile systems, applications, and ser-

vices, pages 315–330. ACM.

SystemSupportforPrivacy-preservingandEnergy-efficientDataGatheringintheGlobalSmartphoneNetwork-

OpportunitiesandChallenges