competitive, both in terms of feature extraction and
computational complexity, for review sentence classi-
fication.
Recently, deep learning based models have gained
popularity among researchers as they have an ability
to learn useful feature representations automatically
from a large corpus of labeled data without manual
feature engineering effort.
Specifically, a deep learning model known as
Convolutional Neural Network (CNN) has recently
achieved encouraging results for various text classi-
fication tasks (Kim, 2014). A recent study of Fu and
Menzies (2017) suggest researchers to always com-
pare computationally expensive models with their
simple and efficient counterparts. For this objective,
we are interested in comparing the powerful deep
learning CNN model with the simple BoW model.
We formulate the second research question (RQ2) as
follows:
RQ2: How does the deep learning based CNN
classifier compare with the simple BoW model for
app review sentence classification?
To answer RQ2, we experiment with CNN-based
models for review sentence classification. For that,
we adopt the model proposed by Kim (Kim, 2014). A
comparison of CNN model performance with MaxEnt
model with BoW features shows that on average, the
CNN-based model performs slightly worse than the
BoW model. However, for the review sentence ty-
pes feature request and bug report, which are some of
the most informative sentence types to the developers,
CNN-based models obtain the highest precision.
The rest of the paper is structured as follows.
Section 2 summarizes the related work. In Section 3,
we describe the dataset used for this study. In
Section 4, we provide the description of the features
and models used in this study.
Section 5 details the experimental setting.
Section 6 discusses the results. In Section 7, threats
to validity are examined. Conclusions are presented
in Section 8.
2 RELATED WORK
The system “SUR-Miner” proposed classifying re-
view sentences into feature evaluation, praise, bug re-
port, feature request, and other (Gu and Kim, 2015).
They used a MaxEnt model for the classification task
with a rich set of lexical and structural features ex-
tracted with NLP tools. We adopt their feature set and
compare it to the BoW model. However, our results
Table 1: Definition of five review sentence types used in
the study of Gu and Kim (2015)
Sentence type Definition Examples
Praise Expressing emotions with Excellent!
specific reasons I love it!
Amazing!
Feature Expressing opinions about The UI is convenient.
Evaluation specific features I like the prediction text.
Bug Report Reporting bugs, glitches It always force closes
or problems when I click the ”.com”
button.
Feature Suggestion or new feature It's a pity it doesn't
Request requests support Chinese.
Other Other categories defined I've been playing it
in (Pagano and Maalej, 2013) for three years.
are not directly comparable to theirs because they trai-
ned a separate model for each app while we train a
single model incorporating sentences of all apps in
the dataset, thus having a larger training set.
Maalej et al. (2015) experimented with different
classification models to classify reviews into feature
request, bug report, rating, and user experience. They
experimented with various features, including BoW.
However, they evaluated their models on review-level
and not on sentence level as we do in this work. Si-
milarly to us, they trained their models on the whole
dataset of different apps.
Chen et al. 2014 proposed the system “AR-
Miner” to help developers filter out informative re-
views. Their system classifies review sentences into
two classes: informative and non-informative.
The study of Panichella et al. (2015) assigned a
different set of categories to reviews based on user in-
tentions, i.e., opinion asking, problem discovery, so-
lution proposal, information seeking, and information
giving, and trained a learner to automatically classify
reviews into those categories.
All these previous studies have used manual fea-
ture engineering for their classification models. Ac-
cording to our knowledge, this the first study that also
experiments with features automatically learned with
a deep neural network to classify app reviews. Mo-
reover, none of the previous studies has established
the BoW baseline for review sentence classification,
which is one of the simplest feature sets that does not
require any feature engineering or external tools, and
which despite of its simplicity can be very effective.
3 DATASET AND
PREPROCESSING
For this study, we used the app review dataset contri-
buted by Gu and Kim (2015). The dataset contains
labeled review sentences of 17 apps belonging to dif-
Simple App Review Classification with Only Lexical Features
113