Concept Extraction with Convolutional Neural Networks

Andreas Waldis, Luca Mazzola and Michael Kaufmann

Lucerne University of Applied Sciences, School of Information Technology, 6343 - Rotkreuz, Switzerland

Keywords:

Natural Language Processing, Concept Extraction, Convolutional Neural Network.

Abstract:

For knowledge management purposes, it would be interesting to classify and tag documents automatically

based on their content. Concept extraction is one way of achieving this automatically by using statistical or

semantic methods. Whereas index-based keyphrase extraction can extract relevant concepts for documents,

the inverse document index grows exponentially with the number of words that candidate concpets can have.

To adress this issue, the present work trains convolutional neural networks (CNNs) containing vertical and ho-

rizontal ﬁlters to learn how to decide whether an N-gram (i.e, a consecutive sequence of N characters or words)

is a concept or not, from a training set with labeled examples. The classiﬁcation training signal is derived from

the Wikipedia corpus, knowing that an N-gram certainly represents a concept if a corresponding Wikipedia

page title exists. The CNN input feature is the vector representation of each word, derived from a word em-

bedding model; the output is the probability of an N-gram to represent a concept. Multiple conﬁgurations for

vertical and horizontal ﬁlters were analyzed and conﬁgured through a hyper-parameterization process. The

results demonstrated precision of between 60 and 80 percent on average. This precision decreased drastically

as N increased. However, combined with a TF-IDF based relevance ranking, the top ﬁve N-gram concepts

calculated for Wikipedia articles showed a high precision of 94%, similar to part-of-speech (POS) tagging

for concept recognition combined with TF-IDF, but with a much better recall for higher N. CNN seems to

prefer longer sequences of N-grams as identiﬁed concepts, and can also correctly identify sequences of words

normally ignored by other methods. Furthermore, in contrast to POS ﬁltering, the CNN method does not rely

on predeﬁned rules, and could thus provide language-independent concept extraction.

1 INTRODUCTION

The research project Cross-Platform Mediation, As-

sociation and Search Engine (XMAS) is aimed at cre-

ating a knowledge management tool based on auto-

mated document tagging by recognition of N-gram

concepts represented by sequences of one or more

words. In an earlier project stage, XMAS combi-

ned TF-IDF approaches with POS-based natural lan-

guage processing (NLP) methods to extract concepts

with N-gram keyword extraction. The goal of this

combined approach is to build an index-based mo-

del for automatic N-gram keyword extraction (Sieg-

fried and Waldis, 2017) by indexing all N-grams, ﬁl-

tering them by relevant POS-patterns, and then ran-

king the relevance of N-grams for documents using

TF-IDF. However this model needs to index all com-

binations of N-grams, which inﬂates over-linearly the

index size as N increases. To adress this issue, the pur-

pose of this research is to evaluate a neural network

based algorithm to decide whether N-grams (i.e., con-

secutive sequence of N words) are concepts. In this

context, a concept is an idiomatic construction that

conveys a meaning for humans. We use the English

Wikipedia as labeled training corpus, We know that

all Wikipedia entry titles are ceratinly concepts, and

our algorithm uses the existence of a Wikipedia en-

try for a given word cominbation as training signal.

The trained neural network should be able to recog-

nize N-gram concepts in a given text. Those concepts

can be be used for entity extraction and automatic tag-

ging without building a huge N-gram-based inverse

document index. Such an algorithm that delivers pro-

per N-gram concepts, regardless of the category and

the size of the corpus, can increase the value of the

XMAS project.

2 STATE OF THE ART

Twenty years ago, the importance of N-Grams for text

classiﬁcation was shown (F

urnkranz, 1998). Many

statistical and semantic methods have been proposed

118

Waldis, A., Mazzola, L. and Kaufmann, M.

Concept Extraction with Convolutional Neural Networks.

DOI: 10.5220/0006901201180129

In Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018), pages 118-129

ISBN: 978-989-758-318-6

for concepts extraction. The use case described by

(Zhang et al., 2016) is an example for the usage of

traditional neural networks, and (Das et al., 2013) for

the statistical approach. More recently, deep learning

(an extension of neural networks with multiples hid-

den layers) is gaining relevance for all aspects of NLP

as mentioned by (Lopez and Kalita, 2017).

2.1 Automated Concept Extraction

Concepts extraction as a ﬁeld of concept mining divi-

des phrases into sequences of consecutive words clas-

siﬁed as concepts and non-concepts. According to

(Parameswaran et al., 2010) concepts are useful by

providing standalone information, in contrast to any

random non-concepts. This information, as in (Dalvi

et al., 2009), can be categorized as object, entity,

event, or topic. This additional classiﬁcation takes

the name of named entity recognition(NER). For in-

stance, the string ”the Guardian newspaper was foun-

ded in 1821” contains 28 N-grams with the length of

one to seven. The concept ”Guardian newspaper” is

one of them and has a signiﬁcantly higher information

level than the non-oncept ”newspaper was founded

in”.

There are several different approaches for deci-

ding whether a phrase is a concept. (Parameswaran

et al., 2010) showed a combination of linguistic ru-

les and statistical methods. They deﬁned these rules

to characterize possible concepts and ﬁlter out non-

concepts. For example, a candidate has to contain a

minimum of one noun and is not allowed to start or

end with a verb, a conjunction, or a pronoun. After

ﬁltering out non-candidates, the remaining ones are

judged by their relative conﬁdence. This is a metric

to help decide if a sub-/super-concept of the candidate

actually ﬁts better as a concept. For example, ”Guar-

dian newspaper” is a better choice than ”Guardian

newspaper was founded in 1821” because of the hig-

her relative conﬁdence.

Another method is shown in (Liu et al., 2016) with

regards to Chinese bi-grams. Like (Parameswaran

et al., 2010) they combine statistical methods with lin-

guistic rules as well, but in contrast, they ﬁrst calcu-

late the statistical metric and then ﬁlter out the results

with linguistic rules. For measurement, they used

the mutual information (MI) and document frequency

(DF) metrics. MI represents the joint probability with

respect to the product of the individual probabilities

for two words in a bi-gram. Since MI tends to pre-

fer rare words, they used the DF value to reduce the

inﬂuence of low-frequency words, as it takes into ac-

count the number of documents containing a bi-gram,

normalized by the total number of documents.

2.2 Word Embeddings

Embeddings f : X → Y maps an object from a space

X to another object of the space Y . One of the usa-

ges of embeddings in the ﬁeld of NLP is, for ex-

ample, to map a word (an item of the space of all

words) to a vector in a high-dimensional space. Since

these vectors have numerical nature, a wide range

of algorithms can use them. The three mainly used

embedding algorithms are Word2Vec (Rong, 2014),

GloVe (Westphal and Pei, 2009), and fastText (Jou-

lin et al., 2016). While GloVe uses statistical infor-

mation of a word, Word2Vec and fastText adopt co-

occurrence information to build a model. They calcu-

late word embeddings based on either the continuous

bag of words (CBOW) model or the skip-gram model

of (Mikolov et al., 2013). Those latter models predict

respectively (CBOW) a word based on surrounding

words or skip-gram surrounding words based on one

word. CBOW and skip-gram rely on an input matrix

(W I) and an output matrix (WO) as weight matrices.

Those randomly initialized matrices are updated after

each training iteration. The purpose of these matri-

ces is to connect the neural network input layer to the

hidden layer through W I and the hidden layer to the

output layer through W O. In both methods W I has

the dimensions V ×N and WO the dimensions N ×V ,

where V represents the size of the vocabulary and N

the size of the hidden layer. After optimizing these

weight matrices, they can be used as a dictionary to

obtain a vector for a speciﬁc word



h =x ∗W I, as dis-

cussed in (Rong, 2014).

Nevertheless, Word2Vec is only able to compute

vectors for trained words, as it uses the vector of the

whole word. One main advantage of fastText is the

possibility of getting a word vector of an unknown

word. To achieve this, it uses the vector’s sum of se-

quences of included characters of one word, instead

of one word as a whole. For example, where, enri-

ched by the padding symbols < and >, is represented

by <wh, whe, her, ere, and er>.

2.3 Convolutional Neural Networks

As a variation of neural networks (NNs), convolution

neural networks (CNNs) are often used in computer

vision for tasks such as image classiﬁcation and object

recognition. Usually, they use a matrix representation

of an image as an input and a combination of diffe-

rent hidden layers to transform the input into a certain

category or object. These layers are used to analyze

speciﬁc aspects of the image or to reduce its dimensi-

onality. Word embedding enables the numeric repre-

sentation of words, and the representation of N-grams

Concept Extraction with Convolutional Neural Networks

119

as a matrix. Furthermore, they can serve as an input

for CNNs. (Hughes et al., 2017), (Kalchbrenner et al.,

2014) and (Kim, 2014) have shown various use cases

for the usage of CNNs in language modeling and text

classiﬁcation. All of them use the word vectors as a

matrix input. Inside the network, they combine dif-

ferent layers to analyze the input data and reduce the

dimensionality. Two main processes are used in the

network to learn:

Forward propagation represents the calculation

process throughout the whole NN to predict the out-

put data for given input data. Each layer of the net-

work takes the output of the previous layer and produ-

ces its updated output. The next layer uses this output

as a new input. This process continues until it reaches

the last layer. A majority of the layers use a weight

matrix to process this transformation. This weight

matrix controls the connection; that is, the strength

between the input neurons and the output neurons. Fi-

nally, the update of these weight matrices represents

the learning process over the time of the entire net-

work.

Back propagation allows the adaption of the neu-

ron’s connections weight based on the error between

the output label and the resulting prediction. The me-

tric mean squared error is shown in Equation 1, with

as the predicted value and the truth as t

E =

∑

−t

)

(1)

The goal of the back propagation process is to ad-

just the network’s weights to minimize the difference

−t

and eventually nullify it. This is done by pro-

pagating the error value layer by layer through the net-

work by calculating the partial derivative of the path

from the output to every weight. Equation 2 shows

how the error of a network can be back propagated to

the weight w

∂E

∂w

∂E

∂c

out

∗

∂c

out

∂c

∗

∂c

∂w

(2)

After distributing the output error over all weights, the

actual learning takes place. Equation 3 shows that

the new weight is the difference between the actual

weight and the error multiplied by the learning rate:

= w

− (lr ∗

∂E

∂w

) (3)

The kind of transformation and the connections

inside the network are deﬁned by the different kinds

of layers used. The following layers are the mostly

used ones:

Convolution layers are used to analyze parts of or

reduce the dimensionality of the input by applying a

linear ﬁlter to the input matrix. This is done by ite-

rating a kernel matrix (K) of the dimension k

∗ k

through the whole input matrix (K). The kernel ma-

trix represents the weights and is updated through

backward propagation. Figure 1 shows an example

of convolution operation based on the formula:

(I ∗ K)

i j

−1

∑

m=0

−1

∑

n=0

∗I(i + m, j + n)K(−m,−n)

After applying the convolution operation to the input

matrix, the bias value adds the possibility of moving

the curve of the activation function in the x-direction

and improve the prediction of the input data. Sub-

Figure 1: Example of a convolution operation (Veli

ckovi

2017).

sequently, the non-linear function adds some non-

linearity. Without that, the output would be a linear

combination of the input, and the network could only

be as powerful as the linear regression. Doing that al-

lows the network to learn functions with higher com-

plexity than linear ones (Nair and Hinton, 2010).

Pooling layers reduce the complexity and compu-

tation demands of the network, without involving any

learning. The layer uses the given input matrix and

creates a more compact representation of the matrix

by summarizing it. It typically works with a 2 ∗ 2

window matrix iterating over the input matrix without

overlapping. There are different kinds of pooling lay-

ers such as max-pooling or average-pooling. Using

max-pooling, the highest value of the four cells ser-

ves as the representation of those cells instead of an

average value.

Dropout layers randomly ignore a percentage of

the input neurons. This process only happens during

the network training, for validation and prediction. As

(Srivastava et al., 2014) showed, applying the dropout

mechanism in a network increases the training du-

ration but also increases the generality and prevents

overﬁtting the network to the training set. The dro-

pout procedure changes for each training sequence,

as it is dependent on the input data.

Flatten layers reduce the dimensionality of the in-

put. For example, they convert a tri-dimensional input

(12x4x3) into a bi-dimensional ones (1x144).

Dense layers are used to change the size of the gi-

ven input vector. This dimensionality change is pro-

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

120

duced by connecting each row of the input vector to

an element of the output vector. This linear transfor-

mation uses a weight matrix to control the strength of

the connection between one input neuron and the out-

put neurons. Like convolutional layers, a bias value

is added after the transformation, and an activation

function adds some non-linearity. Dense layers are

often used as the last layer of the network to reduce

the dimensionality to usable dimension to get the pre-

diction value, for example, with the softmax activation

function, mostly used to get the predicted category in

a classiﬁcation task.

The choice of using CNNs in this work is derived

from the traditional way this architecture is used in

NLP applications: to extract position-invariant featu-

res from each input data set, for example in image

processing. Conversely, RNNs (and their variant

LSTM) are good at modeling units in sequence, which

are usually temporally controlled (Yin et al., 2017).

As we did not model our input data being tempo-

rally dependent, but rather segmented the raw text

into chunks of predeﬁned length, we adopted a non-

recursive approach,to beneﬁt from the better perfor-

mances of a pure feed-forward networks, such as the

convolutional ones.

3 CONCEPTUAL MODEL

The experimental setup has the following characteris-

tics: The input of the neural network is represented by

a list of all N-grams extracted from the English Wiki-

pedia corpus, with a length of 7, encoded into a ﬁxed

300-dimensional matrix by the word embedding mo-

del. The neural network is trained by using the set of

Wikipedia page titles as the gold standard for deciding

whether a sequence of words represents a concept: if

an N-gram corresponds to a Wikipedia entry title, the

training signal to the neural network is 1; else 0. The

output of the neural network, for each N-gram, is a

prediction of whether it represents a concept or not,

together with the probability. The goal is to maxi-

mize the precision of the concept list, to obtain a high

hit rate. The objective is to support automatic docu-

ment tagging with N-gram concept extraction. Eva-

luation of the neural network’s output success, again,

uses Wikipedia page titles. If the neural network clas-

siﬁes a word sequence as a concept, then this is a true

positive (TP) if there is a Wikipedia page with this

title; otherwise, it is a false positive (FP). If the net-

work classiﬁes an N-gram as a non-concept, then this

is a true negative (TN) if there is no Wikipedia entry

with that name, or else it is a false negative (FN).

3.1 Network Architecture

Figure 2 shows the NN network architecture.

Figure 2: Basic Network Structure.

The input matrix is ﬁxed to the dimensions 7×300

and contains the vector representation of the N-grams.

Since the network needs a ﬁxed input size, the matrix

will be ﬁlled up with zero vectors. After specifying

the input of the network, the convolution layers start

analyzing the given data. For this purpose, two se-

parate network paths analyzing the data in the hori-

zontal and vertical directions are built. Each of those

two paths includes multiple convolution layers with

different dimensions to gather different perspective of

the data. All layers use a one-dimensional convolu-

tion layer that maps the two-dimensional inputs to a

one-dimensional output. As shown in Figure 3 ver-

tical convolution layers use ﬁlters with a ﬁxed width

of 300 and a dynamic height (here 2,3, 4). On the

Figure 3: Vertical Convolution Layer.

other hand horizontal convolution Layers (shown in

Figure 4) are using a ﬁxed height of 7 and a dynamic

width (with typical values 30, 50,70). The rectiﬁed

linear unit (ReLU), f (x) = max(0, x), serves as acti-

vation function for all ﬁlters, to add non-linearity to

the output. Since the output dimension of the convo-

lution layers is related to the ﬁlter size, the outputs of

the ﬁlters are not balanced. For example, a vertical

ﬁlter of size 2 × 300 produces an output vector of size

6 while an output vector for a ﬁlte size of 4 × 300 has

Concept Extraction with Convolutional Neural Networks

121

Figure 4: Horizontal Convolution Layer.

length of 4. To deal with this aspect, after the pool-

ing layers have reduced the complexity, some dense

layers decrease the length of the output vector of each

horizontal and vertical ﬁlter to 2. They also use the

ReLU function as their activation function.

After reducing the dimensions, a merge layer joins

all vectors of the horizontal and vertical paths. Then,

the dimensionality of both resulting vectors will be

reduced again to 2 by the usage of a dense layer. This

eventually results in two vectors of length 2, 1 for the

horizontal path and 1 for the vertical path.

To get the ﬁnal prediction for the input, a merge

layer joins both vectors of the two paths into one vec-

tor with length 4. Subsequently, a ﬁnal dense layer

uses the softmax function (as seen in Equation 4) to

reduce the dimensions to 2. It squashes all values of

the result vector between 0 and 1 in a way that the

sum of these elements equals 1. The processed result

represents the probability of one input N-gram being

marked as concept.

exp(a

( ¯x))

∑

exp(a

( ¯x))

(4)

3.2 Word Embedding as N-Gram

Features

For NLP applications, the choice of a word embed-

ding plays a fundamental role, while also holding

some contextual information about the surrounding

words. This information could enable an NN to re-

cognize N-grams that it has never seen in this se-

quence, thanks to a previously seen similar combina-

tion. For example, the training on the term ”Univer-

sity of Applied Science” could also enable the recog-

nition of ”University of Theoretical Science”, thanks

to the commonalities within the N-gram structure and

despite their semantic differences.

The following two models generate the vector re-

presentation of a word: Word2Vec is a pre-trained

300-dimensional model without additional informa-

tion hosted by (Google, 2013). Word2Vec-plus is our

extended version of the pre-trained model from (Goo-

gle, 2013). It uses words with a minimum frequency

of 50, extracted from a data set with 5.5 million Wiki-

pedia articles. To get a vector representation v

for an

unknown word, the approach uses the average vector

representation of the surrounding four words, if they

have a valid vector, or the zero vector, if they are also

unknown (shown in Equation 5).

v

= avg(

∑

i=−2

i6=0

v

)

(

v

, w

= known word

v



0, w

= unknown word

(5)

After averaging the unknown vector for one occur-

rence, the overall average v

will be recalculated. As

shown in Equation 6, the existing average v

n−1

will

be multiplied by the previous occurrences w

n−1

of the

word and added to the vector calculated in Equation 5.

This value will be divided by the number of previous

occurrences plus 1, to get a updated overall average

for the unknown word. This variant of the average

calculation prevents large memory consumption for

an expanding collections of vectors.

v

n−1

∗ w

n−1

+ v

n−1

+ 1

(6)

4 EXPERIMENTAL EVALUATION

Different network conﬁgurations were tested to ﬁnd

the best model for classifying concepts and non-

concepts in our test case, according to the concpe-

tual model described in the previous section. Figure 5

shows the iterative training pipeline:

1. Initially, the features of all N-grams found in the

English Wikipedia corpus were calculated by ex-

tracting the corresponding word vectors from the

embedding model.

2. Afterwards, the data set was separated into the

training set (80%) and the test set (20%).

3. The training process used the training set to ge-

nerate the extraction network. In this phase, the

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

122

Figure 5: The adopted training pipeline.

network was trained to recognize n-grams that are

likely to represent a Wikipedia page title, based

on their structure.

4. The prediction of the resulting network was based

on the test set. During this process, all items of

the test set were classiﬁed either as concepts or

non-concepts.

5. The veriﬁcation of the network was based on pre-

cision, recall, and f1 metrics calculated during the

previous evaluation.

6. Based on the performance comparison of a run

with the previous ones, either the pipeline was ﬁ-

nished or another round was started with an upda-

ted network structure.

4.1 Data Encoding

The encoding used for presenting the data to the net-

works inﬂuences their performance, especially its abi-

lity for generalization. A good generalization depends

mainly on the following three aspects:

• Data balancing: the input data is well-balanced

by containing a similar amount of concept and

non-concept examples. The ﬁnal data set con-

tains one million concepts and one million se-

lected non-concepts. These two million samples

do not ﬁt completely into memory; thus dividing

them into 40 parts avoids a memory overﬂow. The

training process loads all of these parts, one by

one for each epoch.

• Data separating: existing samples are separa-

ted into the training part (80%) and the test

part (20%) to prevent overﬁtting of the networks.

Cross-validation gives information about the level

of generalization and mean performance. First of

all, it separates the training part into four parts

(25% of the original 80% set). Each of them is

used once as validation part, while the remaining

three parts serve as training data for the network.

The precision, recall, and f1 score give a weight

to each run of the cross-validation process of each

network. Further changes to the network structure

are based on these values to improve the perfor-

mance. Also, those metrics are used to select the

better performing and most stable networks. A ﬁ-

nal training run on these network uses the whole

training part as input data and the quality part to

produce a ﬁnal measurement of the best networks.

• Shufﬂing of the input data helps to get early con-

vergence and to achieve better generalization, as

also mentioned by (Bengio, 2012). For this pur-

pose, the training environment loads all training

parts in each epoch in a new random order. Furt-

hermore, it shufﬂes all examples inside each part

before generating the batches to send as network

inputs.

4.2 Selected Architectures

Table 1 speciﬁes the different network conﬁgurations

adopted to investigate how vertical (v-ﬁlters) and ho-

rizontal (h-ﬁlters) ﬁlters can affect the performance of

the resulting network.

Concept Extraction with Convolutional Neural Networks

123

Table 1: Different model combinations.

Name v-ﬁlters h-ﬁlters

V3H0 (2 ,3, 4) ()

V6H0 (2 ,3, 4, 5, 6, 7) ()

V0H3 () (100, 200, 300)

V3H1 (2, 3, 4) (1)

V3H3 (2, 3, 4) (100,200,300)

V6H1 (2, 3, 4, 5, 6, 7) (1)

V6H3 (2, 3, 4, 5, 6, 7) (100, 200, 300)

V6H6 (2, 3, 4, 5, 6, 7) (10, 20, 30, 40, 50, 60)

4.3 Hyperparametrization

The intervals of the parameters shown in Table 2 are

considered during the training. The actual combina-

tion differs by use case or experiment and is based

on well-performing sets experienced during the whole

project.

Table 2: Hyperparameters.

Parameter value

Dropout 0.1-0.5

Learning Rate 0.0001, 0.0005

Epochs 100-400

Batch Size 32, 64, 128, 256

4.4 Integrated Evaluation

The evaluation was performed in two steps, based

on different degrees of integration and heterogeneous

networks conﬁgurations.

We ﬁrstly relied on the output of the already exis-

ting solution based on the part of speech (POS). This

was performed by using a restricted labeled data set,

evaluating 20,000 N-grams for each of the networks.

This is considered as a baseline comprehension mea-

surement. For this purpose, the data set contains ex-

amples that have already been labeled by the different

networks. This means the results produced by the ex-

isting solution were used as inputs. For each network

the balanced data set contains, respectively, 5,000 true

positive, true negative, false positive, and false nega-

tive examples.

Eventually, by replacing the existing approach

completely with the NN solution, we achieved a fully

integrated pipeline. For this purpose the prototype

extracts for all 130,000 N-grams — with the highest

quality — the top ﬁve keyphrases. These are chosen

out of a list of extracted concepts.

To limit the human effort in classifying the results,

we relied on the assumptions that valid concepts are

statistically present as page names (titles) into Wiki-

pedia, and that non-concepts are likely to not appear

as page titles in this source, despite the known limits

of this approach.

5 RESULTS

The evaluation of the different experiments yielded

the following observed results:

5.1 Generality

One way to measure the generality of a network

is through a k-fold evaluation. As this process is

normally computationally expensive, we reduced the

time required to do the cross-evaluation by limiting

k to the value 4 and by considering just the usage

of 100 epochs, for the training. The validity of the

last simpliﬁcation is also supported by the observa-

tion that during all observed experiments, the lear-

ning curve never changed signiﬁcantly after that ite-

ration, but only continued in the identiﬁed direction

towards the optimal value. The 4-fold evaluation was

run twice to gather eight runs per network. The re-

sulting values from the training phase were then used

to calculate the precision, recall, and f1 for all runs.

Figure 6 shows a summary of these training executi-

ons metrics for all networks. The majority of them

observed very close measures (limited to differences

of 0.05). This suggests that they were all stable in trai-

ning and were producing relatively general models.

However, some outliers were detected in almost all of

the models. Comparing a single set of 4-folds curves

Figure 6: 4-fold summary.

(such as on Figure 7) reveals one of those outliers.

The existence of these outliers indicates that there is

still a lack of generality within the network.

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

124

Figure 7: 4-fold precision for V6H3.

5.2 Word Embedding Comprehension

After evaluating the generality, both embedding mo-

dels Word2Vec and the extended version Word2Vec-

plus were compared. For this purpose, Figure 8

shows the different performance metrics of both mo-

dels for all network architectures. It seems that there

were no signiﬁcant differences between these two

models. Nevertheless, Word2Vec-plus in combina-

tion with the V6H6 architecture gained the best per-

formance among the tested combinations. Additio-

Figure 8: Models comprehension.

nally, as shown in Figure 9, the mean learning cur-

ves of the networks based on Word2Vec-plus were

much smoother than those of the base version. One

of the purposes of the Word2Vec algorithm is to iden-

tify words that appear in a similar context (and thus

have some similarity). Looking at the additionally

generated vectors in Word2Vec-plus, it appears that is

only partly true. In fact, for words with a high fre-

quency in the text, these surroundings can degenerate

into random entries. For example, the pronoun a has

just random similar words, but the word ”Husuni”

has a high similarity with ”Kubwa” and ”Palace”.

This correct similarity indication is justiﬁed by the

facts that a palace in Tanzania exists named Husuni

Kubwa, and that the word itself is not frequent.

5.3 Vertical and Horizontal Filters

During the experiment different vertical and horizon-

tal ﬁlters were used in different models, giving the

following indications:

• By considering precision and f1, it looks like the

performance increased with the number of ﬁlters

(vertical and horizontal). On the other hand, re-

call revealed fewer peculiarities between the dif-

ferent networks but a greater variance and also the

appearance of some outliers.

• The horizontal ﬁlters show that the performance

of the model in which H = 0 is almost as good

as H = 6, but with a greater dispersion. Thus, a

larger number of vertical ﬁlters may support the

generality of the network.

5.4 Overall Performance

Figure 10 shows the resulting precision, recall, and

f1 values for all vertical and horizontal ﬁlter combi-

nations, after the 400 training epochs. Here, some

differences emerge:

• Based on precision, the V3H1 network had slig-

htly better performance than V6H6 and V3H3,

with a score of 0.8875.

• Considering the recall, on the other hand, the

V0H3 (0.9625) architecture outperformed V6H3

and V6H6.

• Using the f1 score, V0H3, V6H3, and V6H6 net-

works outperformed all others. However, among

them, none has a signiﬁcantly better performance,

with all in the range from 0.91 to 0.9175.

As the N-gram length can play a role in the per-

formances, their total count and distribution between

valid and non-valid concepts in the test data set are

reported in Table 3. The unbalanced distribution is

clearly evident. In fact the performances of all net-

Table 3: Test data set distribution.

length total count concepts non concepts

1-gram 90413 91.9% 8.1%

2-gram 164463 61.2% 38.8%

3-gram 107170 21.4% 75.9%

4-gram 52997 14.3% 85.7%

5-gram 20638 11.7% 88.3%

6-gram 8217 10.9% 89.1%

7-gram 3843 8.2% 91.8%

Concept Extraction with Convolutional Neural Networks

125

Figure 9: Embedding models comprehension: Comparison of Word2Vec and Word2Vec-plus.

Figure 10: Overall performance of the tested vertical and

horizontal ﬁlters combinations in the training phase.

Figure 11: Learning curves.

works decrease as the N-gram length increases. Other

than evaluating the whole test data set at once, the

performance gaps of the different networks increased

until some networks fell below 0.5. As before, V6H6,

V6H3, and V0H3 outperformed their competitors; ad-

ditionally, V3H0 performed almost equivalently. This

suggests that they are more robust against unbalanced

data and can achieve a more stable training process.

By looking at the precision and the loss curve

(shown in Figure 11), it seems that some of the net-

works still have the potential to improve, especially

since the loss curve did not converge completely after

400 epochs. V6H3 and V3H3 might achieve better

performance as the number of epochs increases. In

contrast to those two, V6H6 seemed to reach its op-

timum at the end. V6H6 ran, in contrast to the ot-

hers, with a learning rate of 0.005 instead of 0.001.

The higher structural complexity and the correspon-

dingly higher computational complexity supports the

increased learning rate. Table 4 lists some classiﬁca-

tion examples, separated by their membership in the

confusion matrix. True positive (TP) and true nega-

tive (TN) contain meaningful examples. The phrase

”carry out” is an example of a concept that does not

make sense out of context, but there is a Wikipedia

page about it. Similar phrases can be found from

among the FP examples, such as ”University of The-

oretical Science” and ”Mexican State Senate”: they

look like proper concepts but there is no Wikipedia

entry with that title. They were probably selected be-

cause of the similar structure to some concepts. This

also happened in the opposite direction; for exam-

ple the phrase ”in conversation with” is classiﬁed as

non-concept based on the similarity with actual non-

concepts; yet there is a TV series on BBC with the

same name.

5.5 Integration and Validation

An evaluation of each network against our initial

POS- and TF-IDF-based approach should give a fee-

ling on how well they behave, on top of the statis-

tical evaluation. The CNN approach was integrated

into the TF-IDF keyword extraction, using it as a con-

cept candidate ﬁlter in comparision to POS-based ﬁl-

tering. For this purpose, separate data sets were used

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

126

Table 4: Examples of neural network output, classiﬁed as

true positives (TP), false positives (FP), true negatives (TN),

and false negatives (FN) with regards to existing Wikipedia

page titles.

TP American Educational Research Journal

Tianjin Medical University

carry out

Bono and The Edge

Sons of the San Joaquin

Glastonbury Lake Village

Earl of Darnley

TN to the start of World War II

must complete their

just a small part

a citizen of Afghanistan who

itself include

NFL and the

a Sky

FP Regiment Hussars

University of Theoretical Science

Inland Aircraft Fuel Depot

NHL and

Mexican State Senate

University of

Ireland Station

In process

FN therefore it is

use by

in conversation with

Council of the Isles of Scilly

Xiahou Dun

The Tenant of Wildfell Hall

to compute the labels given by the NN and then used

as sources for the POS/TF-IDF performance compa-

rison. Eight data sets (one for each NN conﬁguration)

were initialized with a precision and recall of 0.5. Ta-

ble 5 shows the general performance of CNN base

concept extraction on these validation data sets. As

can be seen, the precision and recall values were sig-

niﬁcantly lower than those obtained from the training

phase.

Figure 12 lists the resulting precision and recall of

evaluating only the top ﬁve keywords for different N-

gram lengths. In general, all of the networks perfor-

med slightly better regarding the f1 score. Conside-

ring recall, they all outperformed the POS approach,

but they obtained lower values regarding precision.

Looking at the N-Gram length level reveals additio-

nal differences (Figure 12) when comparing the mean

performance of the initial approach (POS) to those of

various networks. Especially in consideration of the

recall value, the POS-based concept extraction almost

missed all concepts with greater length. On the con-

trary, the different networks could catch them but at

the price of lower precision. Overall the different net-

works had better f1 values than the POS approach.

Table 5: CNN models’ results to the validation set.

Network precision recall f1

V6H6 0.650 0.323 0.432

V6H3 0.659 0.326 0.436

V3H0 0.731 0.317 0.442

V0H3 0.694 0.331 0.448

V6H0 0.702 0.334 0.452

V3H3 0.649 0.348 0.453

V6H1 0.668 0.353 0.462

V3H1 0.640 0.366 0.466

Table 6 shows the total precision metrics for di-

ferent NN conﬁgurations, compared to the best, the

worst and the average conﬁguration of POS-based

concept ﬁltering. Over all, the POS approach can re-

ach a higher maximal precision.

Table 6: Performance metrics for a combination of CNN-

based N-gram concept recognition with a TF-IDF-based re-

levance ranking, based on extracting the top 5 keyphrases,

applied to a corpus of 100K Wikipedia articles, compared

to the minimal, maximal and average precision of different

conﬁgurations of POS-based concept recognition. The pre-

cision or hit rate represents the average percentage of top

5 key-phrases per document that correspond to a Wikipedia

concept.

CNNs POS

V3H1 0.942 V6H0 0.931 max 0.984

V6H6 0.941 V6H3 0.937 min 0.823

V3H3 0.940 V0H3 0.932

V6H1 0.938 V3H0 0.927 mean 0.927

Figure 12: Performances, ith respect to N-gram length, of

the CNN and POS approaches integrated with TF-IDF ran-

king and top-ﬁve cutoff.

Concept Extraction with Convolutional Neural Networks

127

6 DISCUSSION

The above experiments present a way to recognize

word sequences as candidate concepts for key-phrase

extraction.

For the application of concept extraction to au-

tomatic tagging of documents, we are interested in

high precision because false positives decrease user

acceptance of the system. Regarding the recall and

f1, CNN solutions performed well on the classiﬁca-

tion task, but none of the tested conﬁgurations was

able to achieve a precision scores as high as the re-

call. This behavior is a disadvantage, especially for

preventing false positives, which is important for user

acceptance of automatic tagging.

In the training phase, the outcome demonstrated

a precision of between 80 and 90 perent. However,

integrating the CNN concept extraction into the initial

prototype and thus applying it to a different validation

dataset only showed a precision of between 60% and

80%, on average. So there was an overﬁtting taking

place when training the CNN.

Anyhow, combined with a TF-IDF based rele-

vance ranking, the top ﬁve n-gram concepts calcula-

ted for wikipedia articles showed a greater precision

of up to 94%. This means that on average, out of four

documents, each with ﬁve automatically extracted top

keywords, only one document contained an N-gram

that is not a Wikipedia entry title.

Yet, the precision of the best performing POS con-

cept ﬁltering together with TF-IDF relevance ranking

and top ﬁve cutoff was even better with 98%. Howe-

ver, this precision decreased drastically with increa-

sing number of words in the N-grams. The POS ap-

proach ﬁltered out many N-grams with greater N. The

CNN-based approach recognized much more N-gram

concepts, which can be seen in the recall curves in

Table 6 by comparing the blue straight line represen-

ting CNN-recall) with the blue dotted line represen-

ting POS-recall.

Using Wikipedia as gold standard, there is a ge-

neral acknowledgment that each page certainly repre-

sents a concept. Of course, the opposite is not true: if

there is no Wikipedia entry for a phrase, it could still

be valid concept. Although we did not run analyses

in this respect, (Parameswaran et al., 2010) provided

contributions through crowd-sourced effort, demon-

strating that the percentage of valid concepts not ex-

isting as Wikipedia pages is less than 3% of all the

n-grams in the FN category.

As seen, the networks had weaknesses regarding

generality. They were able to perform on unseen data

as well as they did on the validation set during the

training. However, when repeating the training for

the same network, they revealed outliers. Too much

dropout overall or in the wrong position could have

been one of the sources for this behavior. Thereby,

the networks may have received too much random-

ness and were unable to learn the small essential diffe-

rences between the word vectors. Furthermore, some

dropout could have been replaced by the usage of L1

and L2 regularization. This would polarize the con-

nections by manipulating the weights (Ng, 2004) to-

wards a simpler network with either heavy weights or

no weights between neurons.

There are several aspects to be considered for furt-

her research projects: a) Experimenting with different

word features could increase the performance signiﬁ-

cantly. b) Instead of using a balanced list of concepts

and non-concepts, the training data could be genera-

ted by going through the text copus word by word.

Thus, the network would be trained with n-grams in

the sequence they appear in the text. Thus, frequent

N-grams would be getting more weight. c) Changing

the input consideratiosn and using a recurrent neural

network instead of a CNN could improve the results.

Compared to the latter, RNNs do not require a ﬁxed

input and was found to somehow outperforms it in

some NLP tasks. d) One network could be trained

per N-gram length, so that the network does not need

to take all the different distributions into account at

once.

REFERENCES

Bengio, Y. (2012). Practical recommendations for gradient-

based training of deep architectures. CoRR, abs/

1206.5533.

Dalvi, N., Kumar, R., Pang, B., Ramakrishnan, R., Tom-

kins, A., Bohannon, P., Keerthi, S., and Merugu, S.

(2009). A web of concepts. In Proceedings of the

Twenty-eighth ACM SIGMOD-SIGACT-SIGART Sym-

posium on Principles of Database Systems, PODS

’09, pages 1–12, New York, NY, USA. ACM.

Das, B., Pal, S., Mondal, S. K., Dalui, D., and Shome, S. K.

(2013). Automatic keyword extraction from any text

document using n-gram rigid collocation. Int. J. Soft

Comput. Eng.(IJSCE), 3(2):238–242.

urnkranz, J. (1998). A study using n-gram features for text

categorization. Austrian Research Institute for Artiﬁ-

cal Intelligence, 3(1998):1–10.

Google (2013). Googlenews-vectors-negative300.bin.gz.

https://drive.google.com/ﬁle/d/0B7XkCwpI5KDYNl

NUTTlSS21pQmM/edit. (Accessed on 01/15/2018).

Hughes, M., Li, I., Kotoulas, S., and Suzumura, T. (2017).

Medical text classiﬁcation using convolutional neural

networks. arXiv preprint arXiv:1704.06841.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T.

(2016). Bag of tricks for efﬁcient text classiﬁcation.

arXiv preprint arXiv:1607.01759.

DATA 2018 - 7th International Conference on Data Science, Technology and Applications

128

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014).

A convolutional neural network for modelling senten-

ces. CoRR, abs/1404.2188.

Kim, Y. (2014). Convolutional neural networks for sentence

classiﬁcation. arXiv preprint arXiv:1408.5882.

Liu, Y., Shi, M., and Li, C. (2016). Domain ontology

concept extraction method based on text. In Compu-

ter and Information Science (ICIS), 2016 IEEE/ACIS

15th International Conference on, pages 1–5. IEEE.

Lopez, M. M. and Kalita, J. (2017). Deep learning applied

to NLP. CoRR, abs/1703.03091.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient estimation of word representations in vector

space. arXiv preprint arXiv:1301.3781.

Nair, V. and Hinton, G. E. (2010). Rectiﬁed linear units im-

prove restricted boltzmann machines. In Proceedings

of the 27th international conference on machine lear-

ning (ICML-10), pages 807–814.

Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regulari-

zation, and rotational invariance. In Proceedings of

the twenty-ﬁrst international conference on Machine

learning, page 78. ACM.

Parameswaran, A., Garcia-Molina, H., and Rajaraman, A.

(2010). Towards the web of concepts: Extracting con-

cepts from large datasets. Proceedings of the VLDB

Endowment, 3(1-2):566–577.

Rong, X. (2014). word2vec parameter learning explained.

arXiv preprint arXiv:1411.2738.

Siegfried, P. and Waldis, A. (2017). Automatische gene-

rierung plattform

ubergreifender wissensnetzwerken

mit metadaten und volltextindexierung. http://www.

enterpriselab.ch/webabstracts/projekte/diplomarbeiten/

2017/Siegfried.Waldis.2017.bda.html.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: A simple way

to prevent neural networks from overﬁtting. The Jour-

nal of Machine Learning Research, 15(1):1929–1958.

Veli

ckovi

c, P. (2017). Deep learning for complete

beginners: convolutional neural networks with ke-

ras. https://cambridgespark.com/content/tutorials/

convolutional-neural-networks-with-keras/index.html.

(Accessed on 01/29/2018).

Westphal, C. and Pei, G. (2009). Scalable routing via

greedy embedding. In INFOCOM 2009, IEEE, pages

2826–2830. IEEE.

Yin, W., Kann, K., Yu, M., and Sch

utze, H. (2017). Compa-

rative study of cnn and rnn for natural language pro-

cessing. arXiv preprint arXiv:1702.01923.

Zhang, Q., Wang, Y., Gong, Y., and Huang, X. (2016).

Keyphrase extraction using deep recurrent neural net-

works on twitter. In EMNLP.

Concept Extraction with Convolutional Neural Networks

129