Step 2: In step 2, the processed description was
subjected to part of speech tagging and
lemmatization. The Stanford part-of-speech tagger is
used (Toutanova et al. 2003) to attach a part-of-
speech tag to each token (i.e. word) in the app
description. More precisely, the app description is
parsed into sentences, which are then processed by
the part-of-speech tagger. When supplied with a
sentence, the tagger can produce an ordered list of
part-of-speeches as the output for each word in the
sentence (such as noun, verb, adjective, etc). For
example, the app called “Beer Calculator” had the
sentence like the following in its description: “By
now we all know that alcohol is bad for you, yet
most of will still go out to have a beer”. When we
subjected this sentence to part-of-speech-tagger the
word ‘By’ was tagged as a preposition, ‘now’ as
adverb, ‘we’ as personal pronoun and ‘all’ as a
determiner, and so on. Thus the overall tagging
results would be By/IN now/RB we/PRP all/DT
know/VBP that/IN alcohol/NN is/VBZ bad/JJ for/IN
you/PRP ,/, yet/RB most/JJS of/IN will/MD still/RB
go/VB out/RP to/TO have/VB a/DT beer/NN, where
IN,RB,PRP,DT, VBP, NN,VBZ, JJ,MD stands for
preposition, adverb, personal pronoun, determiner,
Verb, Noun, adverb, Verb, adjective and model
respectively. Once the descriptions were tagged,
only the verb, adverbs and nouns were extracted as
the initial features. Then extracted features were
subjected to lemmatization in order to get the root
word (e.g. “running” would be lemmatized as “run”)
form a particular extracted token.
Step 3: Once initial set of features were extracted
based on the above mentioned procedure, in step 3,
it was subjected to master feature set check. Master
feature set is a bag of words that contain words
related to app domain. Initial master feature set was
created by lexicographers based on the bag of words
(i.e. dictionary) related to app domain.
To build the master feature list, a corpus for
each category has been created by taking a sample of
100 apps per category and then came up with the
high frequency and higher idf (i.e. rare words)
tokens for each category (top 100 tokens). Then it
was added the tokens into the initial master feature
list. If the extracted top word appears in the master
feature set then it will be considered as one of the
feature for a given app. Thus for each app selected
for the training, features were extracted and kept in a
file in the following format.
“<feature1> <feature2><feature 3>………………………… <feature_n>”.
Now that the features have been extracted, next it
proceeds with building the classification model.
b. Building classification model
Multinomial Naïve Bayes, TF-IDF and Support
vector machines are used as the initial classification
approaches in classifying the apps into the possible
IAB Tier-2 categories. Brief introduction about these
methodologies are detailed below.
Naïve Bayes:
Since the training input is pre-processed app
description, token-based naive Bayes classifier is
used to compute the joint token count in app
description and category probabilities by factoring
the joint into the marginal probability of a category
times the conditional probability of the tokens given
the category defined as follows.
,|∗
Conditional probabilities of a category given tokens
are derived by applying Bayes's rule to invert the
probability calculation:
|,/
|∗/
Since Naïve Bayes assumes that tokens are
independent of each other (this is the "naive" step):
|0|∗...∗
.1|
.
|
Then, using the marginalization the marginal
distribution of tokens has been computed as follows:
′,′
′|′∗′
In addition, maximum a posterior (MAP) estimate of
the multinomial distributions also calculated
for over the set of categories, and for each
category, the multinomial
distribution
|
over the set of tokens.
Further, it has been employed the Dirichlet
conjugate prior for multinomials, which is
straightforward to compute by adding a fixed "prior
count" to each count in the training data. This lends
the traditional name "additive smoothing". After
building the Naïve Bayes classifier, extracted
features with the respective categories are passed as
the input to build the classification model.
TF-IDF:
This classifier is based on the relevance feedback
algorithm originally proposed by Rocchio(Rocchio
1971) for the vector space retrieval model (Salton &
McGill 1986). In TF-IDF we considered the app
ICSOFT2014-DoctoralConsortium
8