Authors:
Andreas Waldis
;
Luca Mazzola
and
Michael Kaufmann
Affiliation:
Lucerne University of Applied Sciences, School of Information Technology, 6343 - Rotkreuz and Switzerland
Keyword(s):
Natural Language Processing, Concept Extraction, Convolutional Neural Network.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Business Analytics
;
Data Engineering
;
Data Management and Quality
;
Data Mining
;
Databases and Information Systems Integration
;
Datamining
;
Enterprise Information Systems
;
Health Information Systems
;
Semi-Structured and Unstructured Data
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Text Analytics
Abstract:
For knowledge management purposes, it would be interesting to classify and tag documents automatically based on their content. Concept extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concpets can have. To adress this issue, the present work trains convolutional neural networks (CNNs) containing vertical and horizontal filters to learn how to decide whether an N-gram (i.e, a consecutive sequence of N characters or words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, knowing that an N-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output i
s the probability of an N-gram to represent a concept. Multiple configurations for vertical and horizontal filters were analyzed and configured through a hyper-parameterization process. The results demonstrated precision of between 60 and 80 percent on average. This precision decreased drastically as N increased. However, combined with a TF-IDF based relevance ranking, the top five N-gram concepts calculated for Wikipedia articles showed a high precision of 94%, similar to part-of-speech (POS) tagging for concept recognition combined with TF-IDF, but with a much better recall for higher N. CNN seems to prefer longer sequences of N-grams as identified concepts, and can also correctly identify sequences of words normally ignored by other methods. Furthermore, in contrast to POS filtering, the CNN method does not rely on predefined rules, and could thus provide language-independent concept extraction.
(More)