6 DISCUSSION
The above experiments present a way to recognize
word sequences as candidate concepts for key-phrase
extraction.
For the application of concept extraction to au-
tomatic tagging of documents, we are interested in
high precision because false positives decrease user
acceptance of the system. Regarding the recall and
f1, CNN solutions performed well on the classifica-
tion task, but none of the tested configurations was
able to achieve a precision scores as high as the re-
call. This behavior is a disadvantage, especially for
preventing false positives, which is important for user
acceptance of automatic tagging.
In the training phase, the outcome demonstrated
a precision of between 80 and 90 perent. However,
integrating the CNN concept extraction into the initial
prototype and thus applying it to a different validation
dataset only showed a precision of between 60% and
80%, on average. So there was an overfitting taking
place when training the CNN.
Anyhow, combined with a TF-IDF based rele-
vance ranking, the top five n-gram concepts calcula-
ted for wikipedia articles showed a greater precision
of up to 94%. This means that on average, out of four
documents, each with five automatically extracted top
keywords, only one document contained an N-gram
that is not a Wikipedia entry title.
Yet, the precision of the best performing POS con-
cept filtering together with TF-IDF relevance ranking
and top five cutoff was even better with 98%. Howe-
ver, this precision decreased drastically with increa-
sing number of words in the N-grams. The POS ap-
proach filtered out many N-grams with greater N. The
CNN-based approach recognized much more N-gram
concepts, which can be seen in the recall curves in
Table 6 by comparing the blue straight line represen-
ting CNN-recall) with the blue dotted line represen-
ting POS-recall.
Using Wikipedia as gold standard, there is a ge-
neral acknowledgment that each page certainly repre-
sents a concept. Of course, the opposite is not true: if
there is no Wikipedia entry for a phrase, it could still
be valid concept. Although we did not run analyses
in this respect, (Parameswaran et al., 2010) provided
contributions through crowd-sourced effort, demon-
strating that the percentage of valid concepts not ex-
isting as Wikipedia pages is less than 3% of all the
n-grams in the FN category.
As seen, the networks had weaknesses regarding
generality. They were able to perform on unseen data
as well as they did on the validation set during the
training. However, when repeating the training for
the same network, they revealed outliers. Too much
dropout overall or in the wrong position could have
been one of the sources for this behavior. Thereby,
the networks may have received too much random-
ness and were unable to learn the small essential diffe-
rences between the word vectors. Furthermore, some
dropout could have been replaced by the usage of L1
and L2 regularization. This would polarize the con-
nections by manipulating the weights (Ng, 2004) to-
wards a simpler network with either heavy weights or
no weights between neurons.
There are several aspects to be considered for furt-
her research projects: a) Experimenting with different
word features could increase the performance signifi-
cantly. b) Instead of using a balanced list of concepts
and non-concepts, the training data could be genera-
ted by going through the text copus word by word.
Thus, the network would be trained with n-grams in
the sequence they appear in the text. Thus, frequent
N-grams would be getting more weight. c) Changing
the input consideratiosn and using a recurrent neural
network instead of a CNN could improve the results.
Compared to the latter, RNNs do not require a fixed
input and was found to somehow outperforms it in
some NLP tasks. d) One network could be trained
per N-gram length, so that the network does not need
to take all the different distributions into account at
once.
REFERENCES
Bengio, Y. (2012). Practical recommendations for gradient-
based training of deep architectures. CoRR, abs/
1206.5533.
Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tom-
kins, A., Bohannon, P., Keerthi, S., and Merugu, S.
(2009). A web of concepts. In Proceedings of the
Twenty-eighth ACM SIGMOD-SIGACT-SIGART Sym-
posium on Principles of Database Systems, PODS
’09, pages 1–12, New York, NY, USA. ACM.
Das, B., Pal, S., Mondal, S. K., Dalui, D., and Shome, S. K.
(2013). Automatic keyword extraction from any text
document using n-gram rigid collocation. Int. J. Soft
Comput. Eng.(IJSCE), 3(2):238–242.
F
¨
urnkranz, J. (1998). A study using n-gram features for text
categorization. Austrian Research Institute for Artifi-
cal Intelligence, 3(1998):1–10.
Google (2013). Googlenews-vectors-negative300.bin.gz.
https://drive.google.com/file/d/0B7XkCwpI5KDYNl
NUTTlSS21pQmM/edit. (Accessed on 01/15/2018).
Hughes, M., Li, I., Kotoulas, S., and Suzumura, T. (2017).
Medical text classification using convolutional neural
networks. arXiv preprint arXiv:1704.06841.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.
(2016). Bag of tricks for efficient text classification.
arXiv preprint arXiv:1607.01759.
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
128