“No More Spray and Pray Audience Targeting in Mobile World”

IAB based Classification Approach for Mobile App Audience Measurement

Kajanan Sangaralingam

Department of Information Systems, School of Computing, National University of Singapore, 13 Computing Drive,

Group Digital Life, Singapore Telecommunications, Singapore, Singapore

1 STAGE OF THE RESEARCH

The specific problem this research addresses is how

programmatic media buying (IAB 2013; Ebbert

2012) could help in designing effective mobile app

advertisement campaigns. In particular, it is

proposed that ad campaigns which target the mobile

users via mobile applications would be more

effective when there is a way to determine the

audience information of mobile apps from which

mobile ad requests are generated. In order to solve

this dilemma, in this research it is proposed a

dynamic approach which can effectively measure

the audience demographics for the millions of

existing mobile applications as well as the new

incoming applications. There are mainly four steps

involved with this research. First step is to generate

top n categories for a given app A and estimate the

set of app audience demographic properties (Age,

Education & Has Children) based on the “category

– audience demographic” mapping. In the second

step it has been proposed a way to predict the

gender demographics of mobile app users. In the

third step, accuracy of predicted audience

demographic values is evaluated. In the fourth step

efficacy of using audience demographics data on a

real mobile app ad-campaign will be evaluated.

This research has progressed till step 3 and step 4 is

remaining.

Preliminary experiment of the proposed

approach yielded satisfactory results.

Since most of

the ad-requests do not contain relevant audience

information, proposed approach can be used to plug

this data as a third party platform to the ad-requests.

Given the popularity and usefulness of mobile apps,

studies of this nature can greatly help many

constituents of app eco system.

As a next step (Step 4), the effectiveness of

incorporating proposed programmatic buying

framework on a real time mobile advertisement

campaigns would be evaluated. For example,

suppose an advertiser wants to design an ad

campaign for a newly designed female fashion outfit

then the proposed framework in this study can be

utilized in numerous ways. Initially, using audience

profiling of each app, advertisers can identify the

apps which are mostly used by female users. Then

the most popular female apps at a given time t can

be identified by deriving the overall popularity of an

app in time t. Subsequently, the ads campaign can be

delivered through the set of identified apps.

In order to measure the effectiveness of this

programmatic buying framework in an ad campaign,

several of experiments will be designed. First, an

experiment will be designed with advertisements by

only using app audience profiles, then a second

experiment would be designed for the same ad

campaign only using popularity signals and a third

experiment would be designed, targeting the apps

which are popular at given time t and being used

mostly by female users. The practical efficacy of the

programmatic buying framework will be validated

against direct ad serving frameworks (i.e. non-

programmatic buying) and programmatic buying

without providing the additional info such as app

audience profile information and app popularity

signals. This way it can be estimated the

effectiveness of each approach and combined

approach in an ad campaign setting at a given time t.

2 OUTLINE OF OBJECTIVES

Mobile apps are increasingly popular in various

markets across the globe. The total number of apps

in the mobile app market and their rate of growth are

remarkable. An average user spends 10% of their

media attention staring at their smartphones and

tablets. Further, flurry reported that during the

period from December 2011 to December 2012 the

average time spent on smartphones by a US

consumer has increased from 94 minutes to 127

minutes (i.e. by 35%) (Simon 2013), while the

average time spent on web has decreased by 2.4%

Sangaralingam K..

“No More Spray and Pray Audience Targeting in Mobile World” - IAB based Classiﬁcation Approach for Mobile App Audience Measurement.

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

(i.e.72 minutes to 70 minutes). On average US

consumers are spending 1.8 times more on apps

compared to the web (Simon 2013). Statistics

indicate that roughly 224 million people use mobile

apps on a monthly basis, compared to 221 million

desktop users i.e. mobile app users are slightly more

than desktop users (Mary 2013b). Moreover, it is

observed that mobiles have become the first screen

and made TV as the second screen during the recent

super bowl event (Mary 2013a). This indicates that

brand owners need to concentrate more on mobile

advertising in order to reach more customers. Thus,

mobile apps have become a lucrative media with a

growing customer base and promising revenue.

With the growing customer base, understanding

the audience properties is crucial to yield greater

business value for mobile advertisers. However,

audience tracking is far more difficult in mobile

context. Commercial audience measurement

agencies Neilson, ComScore and Quantcast

determine the audience characteristics of media

(such as print, radio, TV and internet) often using

panel based approaches. In this approach, set of

users with known demographic information is

recruited, and their behavior is captured either by

survey or by instrumenting their gateway devices

(cable box and browser). Then demographic

attributes of these users are extrapolated to wider

audience. In addition, behavioral weights are used to

correct for potential biases in the recruited panel.

This approach leads to reliable audience estimates as

the popularity of TV shows and websites are

persistent for quite a long time. So the real-time

collection of demographics for TV shows and web

sites is less of an issue. For example, a popular

website such as CNN.com is unlikely to be wiped

out of the map in 60 days. Similarly popular TV

show American Idol is likely to be popular at least

for 90 days. In other words, popular websites and

regular TV shows hardly demonstrate churn.

However, unlike the traditional media (such as TV

and web), mobile app popularities are highly

transient. Table 1 illustrates the top 5 popular apps

based on their store ranks on 1

of May and 1

June 2013 in United States under Games Category.

As can be seen from Table 1 that paid and free

apps that were popular on 1st of May are no more

popular on 1st of June (i.e. within one month

/30days period). Thus we can infer that mobile app

popularities are not persistent. Considering the top

100 apps, on average 46% churn over 30 days and

85% churn in 90 days (Farago 2012).

Table 1: Top 5 Games Apps in US Store.

Top 5 Apps for: iPhone –

US Games Category on

May 2013

Top 5 Apps

for: iPhone - US

Games Category on

June 2013

Free Paid Free Paid

Robot

Unicorn

Attack 2

Survival

craft

Dumb

Ways to

Die

Heads

Up!

Draw

Something

2™ Free

Cut the

Rope: Time

Travel

Candy

Crush

Saga

Bloons

TD 5

PAC-

MAN

DASH!

Minecraft –

Pocket

Edition

Tetris®

Blitz

Block

Fortress

Iron Man

3 - The

Official

Game

Draw

Something

2™

Snoopy

Coaster

Plague

Inc.

Whats

The

Movie?

Teenage

Mutant

Ninja

Turtles:

Rooftop Run

Fast &

Furious

6: The

Game

Kick the

Buddy:

Mercy

Interestingly churn rate of games and lifestyle apps

are extremely high (80% - 90%). If one wants to try

the panel based measurement in this scenario, the

process of panel based data collection needs to

happen almost every week or even every day to have

an accurate measurement, which is impossible to

carry out. In summary, app popularities are highly

volatile and transient in nature therefore, traditional

panel based techniques cannot be used in measuring

the app audience.

3 RESEARCH PROBLEM

Thus, in this study it is aim to resolve this challenge

by proposing a non-panel based reliable scientific

technique. It is proposed a hybrid machine learning

approach based on classification and prediction. In

the classification each app would be assigned to one

or multiple fine grained classes to each app. Based

on the class to which the app belongs, it is assigned

the demographic (Age, Gender, Has Children &

Education) to the app. Using the prediction approach

app’s gender would be predicted.

The proposed hybrid approach has several

advantages compared to the traditional panel based

approach. First, the approach is scalable with the

increased number of mobile apps (currently 1.4

million within Android Play store and Apple iTunes

combined). Second, audience demography of new

apps can be instantly computed as the apps get

ICSOFT2014-DoctoralConsortium

added to the app store and become popular, without

waiting for the panel to be recruited.

4 STATE OF THE ART

In this section it is briefly discussed the literature

and methodologies related to audience measurement

strategies.

4.1 Audience Measurement

Prior research has studied demographic attribute

prediction using user’s web usage pattern. The

studies have used content of the websites (Kabbur et

al. 2010), various types of internet user statistics

such as web page click though data (Hu et al. 2007),

search term (Murray & Durrell 2000; Zhang et al.

2006) to predict user demographic attributes. Adar

(2007) predicted the demographic information of

online audience using vector comparison (known vs.

unknown users) and a bias value for web pages. Hu

et al. (2007)used several methods including

Bayesian classification model, similarity between

users, and multiple classifiers to predict

demographic attributes of users. Murray and Durrell,

(2000) analyzed the search terms entered and web

pages accessed by users and predicted the user

demographic attributes using Latent semantic

analysis (LSA).

In practice, cookies are commonly used to gather

long term data of individual browsing histories.

Cookie is a piece of text sent from website and

stored in user’s web browser while user is browsing

a website. When the user browses the same website

again in future, the cookie is sent back to the website

to notify web user’s previous activity. Despite of the

popularity of cookies, they are often criticized for

privacy concerns (Mayer-Schönberger 1998).

Besides, internet marketing research agency

ComScore, measures the web audience, using a tag

that is propagated throughout the website to be

tracked, which in turn will measure traffic, page

views and other related information. To measure

audience attributes ComScore regularly maintains

around two million panelists who have installed a

background monitoring software that tracks their

online behavior. In addition, series of weight

adjustments are carried out to generate accurate US

or global web demographic. This is detailed by

comScore as “Demographic information is gathered

from our panel. When someone opts into the

comScore panel, they are required to fill out a short

questionnaire where we gather demographic

information for themselves as well as other people in

the HHLD who will be using the metered

computer. We then use census population estimates

to project out to the total internet population”.

Similarly Quantcast, a web analytics service,

measure the web audience statistics by allowing the

registered sites to run its data collection feeds, web

beacons and anonymous cookies to track the online

behavior of web users. Based on the online behavior

of behavior each user, Quancast build a profile of

that person’s browsing habits and hence extrapolate

demographics.

The literature on user demographic prediction

provides with the basic state of the art

methodologies on audience estimation. However the

approach used in literature cannot be utilized in the

context for several reasons. First as discussed earlier,

due to the changing popularity of apps and constant

additions of new apps, the panel based approach will

not work for mobile apps. Second, the number of

apps is so huge (1.4 million for Android Play Store

and iTunes store combined) that the recruitment of

panels for measuring demographics is an impossible

task. Third, similar to cookies, mobile app based

cookie tracking such as Safari flip-flop, HTML5

first party cookies and UDID (unique device

identifier) have also been criticized for privacy

concerns and apps with these tracking tools have

been rejected by platform owners. Therefore, in this

study it looks into a non-panel based techniques that

does not invade the privacy of users.

5 METHODOLOGY

5.1 Intuition

Audience demographics are the quantifiable

measures of a given population. Audience

demographic data are used widely in public

opinion polling, marketing and advertising.

Generally, demographics data of a person

includes gender, age, ethnicity, income, language

and even location. Precise estimation of audience

demographics can help in targeting the right

audience though media (such as web, mobile, TV,

Radio etc). Interactive Advertising Bureau (IAB)

(IAB 2011), an organization for developing industry

standards for advertisements has proposed a

standardized taxonomy for classifying mobile apps,

based on the advice received from taxonomy

experts. This IAB taxonomy has 23 broad categories

in Tier-1, 371 sub-categories in Tier-2 and infinite

number of categories in Tier-3. Table 14 shows

"NoMoreSprayandPrayAudienceTargetinginMobileWorld"-IABbasedClassificationApproachforMobileApp

AudienceMeasurement

some of the IAB’s Tie -1 to Tier-2 category

mapping. In this study it is intended to measure

some of the audience properties of mobile apps in

two ways. First by classifying apps into IAB defined

Tier-2 categories and then derive the specific

audience properties (“Age”, “Has Children” and

“Education”) of each app using the category-

audience mapping. Secondly the gender distribution

of each has been predicted using machine learning

approach. Both the approaches are detailed below.

Ideally the first goal is to generate top n

categories for a given app A and estimate the set of

app audience properties based on the “category –

audience demographic” mapping. For example, it

could be estimated the audience of iTunes app

‘Brides’, which is placed under ‘Lifestyle’ category

in the Apple iTunes store. First the app (‘Brides’)

would be classified into set of IAB Tier-2 categories.

In this case, the iTunes app ‘Brides’ will be

classified into IAB categories such as ‘Society:

Weddings’, ‘Society: Marriage’, ‘Style & Fashion:

Beauty’, ‘Style & Fashion: Fashion’ and ‘Hobbies

and Interests: Photography’. In addition to the

classification it is also obtain a class membership

score for the app in each of these categories. For

example for app (`Brides’) it receives 0.4 as the

score corresponding to `Society: Weddings’

category, 0.3 to `Society: Marriage’ category, 0.2 for

`Style & Fashion: Beauty’ category, 0.05 for `Style

& Fashion: Fashion’ and 0.05 for `Hobbies and

Interests: Photography’. The second step involves

creating the demographic against each of the IAB

Tier-2 categories. For this, first a set of apps are

identified in each category, whose demographics are

well known. For example, it is known that slot

machines are used by older female groups. So it can

be assigned similar demographics to the category

related to slot machine. There are several ways one

can obtain the demographics of such an app. These

apps are called reference apps. Having identified

multiple such reference apps, and their

corresponding demographics, for a given category,

the demographics of corresponding reference apps

are consolidated and overall demographics of the

category is derived. Thus for the given app ‘Brides’,

using the relevant category membership, the

audience demographics would be estimated as age =

‘20-35’, ‘education = ‘grad school & above’,

‘having children = no’.

The second goal is to for a given app A predict

its gender demographics distribution. For example

for the same app mentioned above (App “Brides”),

relevant gender distribution would be 20-80.

Meaning that 20% of users could be male and rest

80 % would be female users. It has been observed

that deriving the gender distribution of an app using

the classification approach discussed above did not

yield satisfactory results. Thus it is proposed using

machine learning approach and predicting the

gender can achieve better accuracy.

Table 2: IAB Tier-1 to Tier-2 mapping.

Tier -1 Business

Family &

Parenting

Sports Society

Tier -2

Advertising

Agriculture

Construction

Government

Human

Resources

Marketing

Adoption

Babies and

Toddlers

Day care/Pre

School

Family

Internet

Pregnancy

Special

eeds Kids

Auto

Racing

Base Ball

Bicycling

Cricket

Football

Inline

Skating

Olympics

Swimming

Dating

Divorce

Support

Gay Life

Marriage

Senior

Living

Teens

Weddings

Ethnic

Specific

5.2 Solution Details

Having described the intuition and the high level

approach, in the next section the details of the

solutions are described.

The solution has 4 major components: (1)

category-demographic mapping, (2) app

classification, (3) audience measurement (Age,

Education and Has Children) (4) gender prediction.

Below it is described each in detail.

5.2.1 Category Demographic Mapping

As described before, it is relied on IAB Tier-2

category for the demographic identification of an

app. One of the important steps in the approach is

determining the demographic of each IAB Tier-2

category. For this purpose, set of reference apps

were identified for each IAB category. Reference

apps are apps that have corresponding websites or

Facebook fan-pages, where the audience

demographics are known. For example for IAB Tier-

2 category “Travel:Hotels” is has been identified

apps like “Hotels.com”, “Travelocity - Book Hotels,

Flights & Cars”and “Kayak” which have their

respective sister websites such as hotels.com,

travelocity.com and kayak.com. In addition, these

set of apps have their respective Facebook pages as

well (e.g. www.facebook.com/travelocity). In the

proposed approach of demographic identification of

mobile apps, it is assumed that mobile app user

demographics are approximately similar to the user

demographics of their corresponding sister websites

ICSOFT2014-DoctoralConsortium

Figure 1 : Feature Extraction Process.

or relevant social media pages (e.g. Facebook fan-

pages). It has been validated this by calculating the

semantic similarity between reference app web site

contents and description of randomly selected apps

in each category. Using this assumption it was

combined the demographics of these sister websites

from known sources (Alexa 2012; Quantcast 2013)

and Facebook fan-pages to derive the demographics

of each reference apps. Next, using the

demographics of reference apps of each IAB

category, it derived the demographic of each IAB

Tier-2 category. As an example, Table 3 shows the

demographics of the “Society: Weddings” category

derived using the above mentioned approach.

Table 3: IAB Category Demographic Mapping.

Tie2 - Category Demographics

Society : Weddings

Age {Child: 0%, Teen:

10%,GenY: 75%,Middle

Age: 10% & Old:5%}

Gender{Male:30% &

Female: 70%}

Has Children{Yes: 15% &

No:85%}

Education :{No College:

5%, College:30% & Grad

School: 65%}

5.2.2 App Classification

a. Data Pre-processing and Feature Extraction

Once the demographic for each IAB Tier-2 category

has been identified, the process of demographic

identification of an app involves identifying the best

possible (in this case it was top 5) IAB Tier -2

categories to which an app can belong to. This

process classifies the existing apps into identified

IAB categories. For this purpose publicly available

app descriptions were used as the main source. For

classification it was followed supervised machine

learning approach. Initially to train the classifiers

approximately 25 apps for each category has been

manually identified. Hence, for all 371 IAB Tier-2

categories, 9205 apps as the training set have been

identified. All these apps have been validated twice

for the accuracy of categorization into their

respective classes (or categories) by professional

lexicographers. For example under the category

“Home & Garden: Gardening”, apps such as

“Garden Insects”, “Gardening and Landscape

Guide” and “Vegetable Gardening Guide” are

identified as training instances.

For each app in the training set, a set of features

has been identified. Figure 1 shows the details of our

feature extraction process that includes several steps.

Step1: In step 1 each app description is checked for

special characters (see Table 4) and if found will be

removed from the app description, then it is

subjected to language test. It has been identified

several non-English apps in the training of samples.

There were many country specific IAB categories

under the Tier 1 category “Travel” and it has

observed that some of the apps were not in English.

For example for the Tier 2 category “Travel: Saudi

Arabia” apps such as “Riyadh Food -  ”

and “Al Tayyar Travel -  ” were

identified as training instances. As the app

description was written using both English and

Arabic, a translator package (i.e. Bing Translator –

Microsoft Corporation, 2011) has been included to

handle app descriptions that contained languages

other than English. Then all the stop words have

been removed (see Table 4) from the app

description.

Table 4: Stop words & Special Characters.

Special

Characters

(



》《㋡〉



〠

_

✹★◆✦✔

•®#*



◇



–

、

〜

,àè:)ññé@#$%&(ñó#/.!;-

=üä?

【】



」「

)

Stop words

(Only some)

“a", "about", "above", "above", "across",

"after", "afterwards", "again", "against",

"all", "almost", "alone", "along", "already"

"NoMoreSprayandPrayAudienceTargetinginMobileWorld"-IABbasedClassificationApproachforMobileApp

AudienceMeasurement

Step 2: In step 2, the processed description was

subjected to part of speech tagging and

lemmatization. The Stanford part-of-speech tagger is

used (Toutanova et al. 2003) to attach a part-of-

speech tag to each token (i.e. word) in the app

description. More precisely, the app description is

parsed into sentences, which are then processed by

the part-of-speech tagger. When supplied with a

sentence, the tagger can produce an ordered list of

part-of-speeches as the output for each word in the

sentence (such as noun, verb, adjective, etc). For

example, the app called “Beer Calculator” had the

sentence like the following in its description: “By

now we all know that alcohol is bad for you, yet

most of will still go out to have a beer”. When we

subjected this sentence to part-of-speech-tagger the

word ‘By’ was tagged as a preposition, ‘now’ as

adverb, ‘we’ as personal pronoun and ‘all’ as a

determiner, and so on. Thus the overall tagging

results would be By/IN now/RB we/PRP all/DT

know/VBP that/IN alcohol/NN is/VBZ bad/JJ for/IN

you/PRP ,/, yet/RB most/JJS of/IN will/MD still/RB

go/VB out/RP to/TO have/VB a/DT beer/NN, where

IN,RB,PRP,DT, VBP, NN,VBZ, JJ,MD stands for

preposition, adverb, personal pronoun, determiner,

Verb, Noun, adverb, Verb, adjective and model

respectively. Once the descriptions were tagged,

only the verb, adverbs and nouns were extracted as

the initial features. Then extracted features were

subjected to lemmatization in order to get the root

word (e.g. “running” would be lemmatized as “run”)

form a particular extracted token.

Step 3: Once initial set of features were extracted

based on the above mentioned procedure, in step 3,

it was subjected to master feature set check. Master

feature set is a bag of words that contain words

related to app domain. Initial master feature set was

created by lexicographers based on the bag of words

(i.e. dictionary) related to app domain.

To build the master feature list, a corpus for

each category has been created by taking a sample of

100 apps per category and then came up with the

high frequency and higher idf (i.e. rare words)

tokens for each category (top 100 tokens). Then it

was added the tokens into the initial master feature

list. If the extracted top word appears in the master

feature set then it will be considered as one of the

feature for a given app. Thus for each app selected

for the training, features were extracted and kept in a

file in the following format.

“<feature1> <feature2><feature 3>………………………… <feature_n>”.

Now that the features have been extracted, next it

proceeds with building the classification model.

b. Building classification model

Multinomial Naïve Bayes, TF-IDF and Support

vector machines are used as the initial classification

approaches in classifying the apps into the possible

IAB Tier-2 categories. Brief introduction about these

methodologies are detailed below.

Naïve Bayes:

Since the training input is pre-processed app

description, token-based naive Bayes classifier is

used to compute the joint token count in app

description and category probabilities by factoring

the joint into the marginal probability of a category

times the conditional probability of the tokens given

the category defined as follows.

,|∗

Conditional probabilities of a category given tokens

are derived by applying Bayes's rule to invert the

probability calculation:

|,/

|∗/

Since Naïve Bayes assumes that tokens are

independent of each other (this is the "naive" step):

|0|∗...∗

.1|









.

|

Then, using the marginalization the marginal

distribution of tokens has been computed as follows:

′,′

′|′∗′

In addition, maximum a posterior (MAP) estimate of

the multinomial distributions also calculated

for  over the set of categories, and for each

category, the multinomial

distribution 









over the set of tokens.

Further, it has been employed the Dirichlet

conjugate prior for multinomials, which is

straightforward to compute by adding a fixed "prior

count" to each count in the training data. This lends

the traditional name "additive smoothing". After

building the Naïve Bayes classifier, extracted

features with the respective categories are passed as

the input to build the classification model.

TF-IDF:

This classifier is based on the relevance feedback

algorithm originally proposed by Rocchio(Rocchio

1971) for the vector space retrieval model (Salton &

McGill 1986). In TF-IDF we considered the app

ICSOFT2014-DoctoralConsortium

description of each app as the input document which

can be classified into many IAB categories. In other

words TF-IDF classifier was adopted to find the best

matching category for the given app description.

Thus TF-IDF approach captures the relevancy

among words, text documents and particular

categories. TF-IDF for a given word extracted in the

Step 1 was computed using the following formula:

























||











 ,

where || is total number of documents in the

corpus and 











is number of times word 



appears in a given document . This word weighting

heuristic says that a word 



is an important

indexing term for document  if it occurs frequently

in it (i.e. the term frequency is high). On the other

hand, words which occur in many documents are

rated less important indexing terms due to their low

inverse document frequency. Training the classifier

is achieved by combining document vectors into a

prototype vector





→for each class 



. First, both the

normalized document vectors of the set of app

description for a class (i.e. positive examples) as

well as those of the other app descriptions for the

other classes (i.e. negative examples) are summed

up. The prototype vector is then calculated as a

weighted difference of each.





→

|









→



→||



→





















→







→





→





(1)

 and are the parameters that adjust the relative

impact of positive and negative training examples.





is the set of training documents assigned to class 

and ||



→|| denotes the Euclidian length of a vector



→.

Learned model for each class is represented by

resulting set of prototype vectors (see equation 1).

This model can be used to classify a new document





. Again the new document can be represented as a

vector





→

using the scheme described above. To classify 



the

cosines of the prototype vectors 



with





→

are calculated. Finally class for the document





would be assigned based on the highest document

vector cosine score.





















cos





→,





→

In this way, TF-IDF classifier has been trained using

the training data of 9205 apps. Then the trained

model is used to predict the class for the rest of the

apps.

Support Vector Machine:

Support Vector Machine (SVM) is a supervised

learning algorithm developed over the past decade

by Vapnik and others (Joachims, 1998; Vapnik,

1999). The algorithm addresses the general problem

of learning to discriminate between positive and

negative members of a given class of n-dimensional

vectors. The SVM algorithm operates by mapping

the given training set into a possibly high-

dimensional feature space and attempting to locate

in that space a plane that separates the positive from

the negative examples. SVM Multiclass library

(Joachims, 2008) has been used to train the SVM

classifier which uses the multi-class formulation

described in (Crammer and Singer, 2002). This

formula has been optimized with an algorithm which

is more scalable in the linear case. SVM Multiclass

library expects the training and testing data in the

following format.

... <feature>:<value> # <info>

Here target and feature should be represented by

integer. Thus all the 371 categories have been given

a unique identifier from 1-371 and each unique

feature is assigned a unique number across training

and testing data.

The target value and each of the feature/value

pairs are separated by a space character.

Feature/value pairs are ordered by increasing feature

number. Features with value zero are skipped in

building the model. The target value denotes the

class of the example via a positive (non-zero)

integer. So, for example, the line

6 1:0.42 3:0.34 9284:0.2 # angry birds

specifies an example of class 6 which is for game for

which feature number 1 has the value 0.42, feature

number 3 has the value 0.34, feature number 9284

has the value 0.2, and all the other features have

value 0. In addition, the app name “angry birds” is

stored with the vector, which can serve as a way of

providing additional information when adding user

defined kernels. All the features are represented by

respective tf-idf values for each category.

As mentioned above, all three classifiers are

trained using the same training data with the

different representation.

"NoMoreSprayandPrayAudienceTargetinginMobileWorld"-IABbasedClassificationApproachforMobileApp

AudienceMeasurement

In the following section it describes audience

measurement process using the output of

classification process.

5.2.3 Audience Measurement

Once an app has been classified using the previously

described approach, this section describes how it is

assigned the demographic to each app. Assume an

app A is classified into set of categories c





,….c



with their respective classification scores





,



,….



. Then their respective weighted average

scores are calculated (



,



,…,



). These

weighted average scores are required, since chosen

classifiers return score values in different ranges and

more importance should be given to the category

which has returned the highest score. If we assume









⋯



then the 



could be calculated

using Proportional Fuzzy Linguistic Quantiﬁer

(PFLQ) technique proposed by Yager (1988) as

follows;













∑











where 



∞,∞



After calculating weighted average scores for each

category, the overall demographics of app A is

estimated as, here 



is the consolidated

demographics for the category 



. Further 



is a

1∗matrix and n is number of different

demographic dimensions. Using this approach

“age”, “education” and “has kid” demographics is

calculated.

















∗









5.2.4 Gender Prediction

Estimating the Gender distribution of mobile app

users using the above mentioned classification

approach did not yield better accuracy compared to

other matrices such as “Age”, “Education” and “Has

Kids.” It has been observed that IAB Tier-2

categories cannot be used to estimate the relevant

gender distribution of an app which belongs to more

than a category. Thus text mining and machine

learning based approaches have been employed to

predict the gender distribution of mobile apps. For

this purpose 9185 apps have been manually and

independently labeled for its gender distribution by 2

professional lexicographers. In this process,

lexicographers have been instructed to label the

gender distribution on the scale of 1-7.

Meanings of these different label ids have been

shown in Table 5. Descriptions of each app have

been given as the source to judge its gender

distribution. For example, the android games app

“Blackjack Vegas” (MobileMediaCom 2014) would

be played mostly by male users than the female

users. Thus it has been labeled as “1” by the

professional lexicographers.

Table 5: Gender Distribution Labels.

Label ID Meaning

1 80 % Male & 20% Female

2 70 % Male & 30% Female

3 60 % Male & 40% Female

4 50 % Male & 50% Female

5 40 % Male & 60% Female

6 30 % Male & 70% Female

7 20 % Male & 80% Female

To assess the reliability and validity of the rating,

inter-judge raw agreement and Hit ratio were

calculated. Inter-judge raw agreement was

calculated by counting the number of items both

judges labeled the same, divided by the total number

of items (Moore & Benbasat, 1991). Hit ratio is the

“overall frequency with which judges place items

within intended labels” (Moore & Benbasat, 1991).

Results show that there are no major concerns with

the labeling validity and reliability of these labels.

Inter-rater raw agreement score, which averaged

0.89, exceeds the acceptable levels of 0.65 (Moore

& Benbasat, 1991). The overall hit ratio of items

was 0.90.

Once lexicographer finished labeling of all 9185

apps, data set is divided in to 2 sets for the training

and testing purposes. For training and testing, 6170

and 3015 apps have been used respectively. Training

instances are made containing fairly equal amounts

of apps in each category (i.e. 1-7). For example,

category 1 and category 2 are allocated with 337 and

391 apps respectively. This way it has been made

sure the over fitting issues did not occur during the

training process.

Further, by using the 9185 apps corpus is built

with the respective tf-idf score of each token. When

building the corpus the, app descriptions were

subjected to same preprocessing mechanism which

was described in Step 2 of App Classification sub

section. Corpus is created by using the Apache

Lucene Indexer. Once the corpus is created, training

model is built using the apps which have been

identified for training purposes. Step by step

procedure of this approach is detailed below. In this

approach different feature selection methodologies

such as Information Gain, Chi-Square, Top 15

Unigrams and Top 10 Unigrams are used and its

accuracy is evaluated. Below it has been detailed the

ICSOFT2014-DoctoralConsortium

steps taken using Top 10 Unigram approach.

1. Each app description is fetched and subjected to

preprocessing as discussed earlier (stemming,

lemmatization and stop word removal)

2. For each app, top 10 descriptive tokens are

identified using the relevant tf-idf scores and then

the master feature set is built. Altogether 20184

features have been identified for master feature set

using the training data set apps.

3. Each app is then represented by these top 10

features. Respective feature’s tf-idf scores have

been used as the numerical representational value.

4. Each app’s gender label (1-7) has been used as the

class variable and rest of all the features have been

used as the predictor variables.

5. Support Vector Machine Regression (Joachims

1998) with Gaussian Radial Basis Function (RBF)

kernel has been used to learn the patterns to

predict the gender. For this purpose statistical tool

R has been used as and the package “e1071” has

been adopted as the relevant package. Then the

training model is built.

6. Test data set apps also subjected to preprocessing

and numerical vector transformation procedure as

described for training data set apps.

7. Then the test data set apps have been feed in to R

with the trained model and relevant gender is

predicted.

In the following section experiment results of the

proposed solutions are discussed.

5.3 Preliminary Experiment

Having estimated the audience for each app using a

hybrid methodology, efficacy of the proposed

solution is analyzed in three steps. First, it has been

analyzed the accuracy of different classifiers used

for predicting the relevant categories of an app.

Secondly, accuracy of audience measurement using

classification approach has been analyzed (“Age”,

“Education” and “Has Children”). Finally, the

efficacy of gender prediction is analyzed. Details of

the experiment procedures are described below.

To measure the accuracy of different classification

approaches used, a test data was built using 372

randomly chosen apps from popular categories such

as Business, Entertainment, Education, Finance,

Game and Style & Fashion. For the identified 372

apps, input data was built using the feature

extraction procedure described above. Then it was

subjected to different classification approaches

discussed above (Naïve Bayes, TF-IDF & SVM).

Based on the classification, the top 5 predicated

classes for each app were chosen and the classes

were validated by the 3 professional lexicographers

for appropriateness. To assess the reliability and

validity of the rating, inter-judge raw agreement and

Hit ratio were calculated. Inter-rater raw agreement

score, which averaged 0.73, exceeds the acceptable

levels of 0.65 (Moore & Benbasat, 1991). The

overall hit ratio of items was 0.82.

Table 6, illustrates the accuracy of different

classifiers across different categories. Overall TF-

IDF achieved highest accuracy of 78% compared to

the other two classifiers. Thus, TF-IDF classifier has

been chosen to estimating the audience.

After validating the accuracy of classifiers, it

proceeds to evaluate accuracy of app audience

estimation specifically the Age, Education and Has

Children matrices. For this purpose, the same test

data (i.e. 372 randomly chosen apps) that were used

in measuring the accuracy of app classification.

Following steps were carried out. First, professional

lexicographers have been employed to manually

estimate the audience of given apps using the

relevant app store URL (e.g.

https://itunes.apple.com/us/app/abc-sight-words-

writing-free/id379874412?mt=8). Then the

automated audience estimation process was carried

out. The efficacy of audience estimation was carried

out by comparing the automated audience estimation

(demographic) values with manually assigned

demographic values using the well-known root-

mean-square-error (RMSE) metric. For N (372

randomly chosen) apps if we obtain 





,





,…,





the estimated demographic values using our

approach and 





,





,…,





as the manually

assigned demographic values by professional

lexicographers, then, the RMSE is computed as

follows,



∑

















..



In this way, demographic dimensions such as “Age”,

“Education” and “Has Kids” achieved 85.5%,

80.9%, and 80.07% accuracies respectively.

As the 3rd step efficacy of proposed gender

prediction mechanism is evaluated. For this purpose

3015 apps and their respective descriptions are used

as the source to build the test data. All the app

descriptions were subjected to same steps as

discussed for training dataset apps (stemming,

lemmatization and stop word removal). After this

step different feature selection approaches such as

Information Gain, Chi-Square, Top 15 bigrams and

"NoMoreSprayandPrayAudienceTargetinginMobileWorld"-IABbasedClassificationApproachforMobileApp

AudienceMeasurement

1-gram tokens and top 10 unigram tokens are used.

Table 7, shows the number of features chosen while

using different feature selection mechanism and

their precision, recall and overall accuracy values. It

can be observed that when using Top-10 unigrams

higher accuracy is produced for predicting the

gender of mobile applications. Thus it has been

identified that prediction accuracy increases when

the matrix size is large. In this case it is

3015*20647.

6 EXPECTED OUTCOME

In this study, it has been identified that important

constituents of app ecosystem face numerous

hurdles in estimating the right audience for mobile

apps. In order to solve this problem, it has been

proposed a dynamic approach which can effectively

measure the audience demographics for the millions

of existing apps as well as the new incoming apps.

Experiment results of the approach yield satisfactory

results. This study has several important

implications. Firstly, by using this audience

estimation method both mobile advertisers and app

developers can greatly benefit by precisely targeting

consumers. Since most of the ad-requests do not

contain relevant audience information, this approach

can be used to plug this data as the third party

platforms to the ad-requests. Secondly, the app

platform owners (e.g. Apple and Android) can use

both classification and audience measurement

methods to effectively classify and estimate the

audience for millions of existing apps and incoming

new apps, and hence reach more consumers. Finally,

this audience estimation can also help mobile app

users in identifying the most suitable app that can

fulfill their needs and wants. Given the popularity

and usefulness of mobile apps, studies of this nature

can greatly help many constituents of app eco

system and has a rich potential to extend the

research of the e-business. Next stage of this

research has been already discussed in the current

stage of the research section of this paper.

Table 6: Classification Accuracy across Categories.

IAB