chosen representation of the posts. In particular, the
information that the sequence of words can provide,
that is, the memory properties of the GRU unit was
not used in our case.
We applied the batch normalization process (Ioffe
and Szegedy, 2015) to decrease the variation of the
distribution of inputs in the internal layers of the deep
neural networks. This method has numerous benefits
such as stabilizing the network, accelerating the learn-
ing and reducing the dependence of gradients on the
scale of the parameters.
We utilized a regularization technique called
dropout to avoid overfitting (Srivastava et al., 2014).
This regularization technique deactivates the neurons
with a given probability provided by a hyperparame-
ter called dropout_rate, during the learning process.
We chose the ReLU (Rectified Linear Unit) activa-
tion function for the inner layers. This function helps
to avoid the vanishing gradient problem which is re-
lated to commonly applied activation functions such
as the sigmoid and tangent hyperbolic in inner lay-
ers of the deep neural networks (Glorot et al., 2011).
For the last layer, which performs the classification,
we chose the softmax function. The first layer (GRU)
uses tangent-hyperbolic as the activation function. As
a loss function, the Binary Cross-Entropy was se-
lected.
For training the model, we applied the Nesterov
Stochastic Gradient Descent method with momentum
(Sutskever et al., 2013). Using this technique, the
value of the loss function is considered not only in the
actual position but also a little ahead in the direction
defined by the hyperparameter momentum.
For evaluation purpose, we applied three met-
rics, the precision, the recall, and the accuracy. The
metrics were calculated using the weighted average
method. Using this method, we focus on the overall
performance, considering the number of samples of
the different classes.
For testing the model, we chose a random subset
from the Stack Overflow questions posted up to until
mid September 2013. This period was selected be-
cause we intended to compare the performance of our
model with that of Ponzanelli et al. (Ponzanelli et al.,
2014) who used the Stack Overflow data dump from
September 2013. That dump is not available anymore;
therefore, we selected the specific period using the
dump created on the 4th of March 2019. The test set
was established to contain two disjunct classes: high-
quality and low-quality questions according to the cri-
teria mentioned in the previous section. Questions
that were edited by their originator after posting were
excluded. In the present study the same filtering con-
ditions were applied as used earlier by Ponzanelli and
co-workers, therefore the direct comparison of our re-
sults with the previous ones is possible.
The test dataset was compiled using balanced sub-
sets of high-quality and low-quality questions. The
full test set contains 110,547 posts from both subsets.
The final working test set was defined as taking 30%
of the full set as examples (66,328 posts) similarly to
the study of Ponzanelli and colleagues.
Our model was trained using the whole ques-
tion dataset without any time restrictions applying the
same selection criteria for the two classes as we did
for the test set. Questions occurring in the test set
were excluded from the training set; hence for the
evaluation of the model, we used questions that were
never seen by the model before.
As our objective is to decide on the potential re-
jection of a question to be posted on Stack Overflow
using only the linguistic features, the training and test
sets contain only the question body, the title and the
tags attached to the given post. The quality label 0
or 1 for low- or high-quality questions, respectively,
along with the unique Id of the posts were also in-
cluded in the training set. The Id was used only for
post identification during the preprocessing, and it
was removed before the actual training.
The whole process of our experiments is presented
on Figure 4.
A cleaning procedure was first applied on the raw
text of question bodies, titles, and tags. In particular,
the code blocks, HTML tags, and non-textual char-
acters were removed. Tokenization was performed
using the nlp method available in the language tool
spaCy
5
.
The cleaned dataset was then transformed to docu-
ment vectors obtained from spaCy using the language
model en_core_web_md, which is a general linguistic
model based on text from blogs, news, and comments.
The model provides word vectors and document vec-
tors which are based on the Word2Vec representation
(Mikolov et al., 2013). The drawback of this approach
is that the document vectors in spaCy are computed
by the mean of the word vectors of the given docu-
ment, which is in our case the post itself including its
body, title, and tags. The dimension of vectors used
in this representation is 300.
Considering that many special words and tech-
nical phrases, e.g., built-in function names, oc-
cur in the posts, that have no vector representa-
tion in spaCy’s vocabulary, we constructed a 200-
dimensional Doc2Vec model (Le and Mikolov, 2014)
with the Gensim library
6
using Stack Overflow ques-
tions. This model is based on the Word2Vec method
5
https://spacy.io/
6
https://radimrehurek.com/gensim/apiref.html
Towards an Accurate Prediction of the Question Quality on Stack Overflow using a Deep-Learning-Based NLP Approach
635