loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Tobias Eljasik-Swoboda 1 ; Michael Kaufmann 2 and Matthias Hemmje 1

Affiliations: 1 Faculty of Mathematics and Computer Science, University of Hagen, Universitätsstraße 47, 58084 Hagen and Germany ; 2 Data Intelligence Research Team, Lucerne School of Information Technology, Suurstoffi 41, 6343 Rotkreuz and Switzerland

Keyword(s): Text Categorization, Unsupervised Learning, Classification, Text Analytics.

Related Ontology Subjects/Areas/Topics: Business Analytics ; Data Engineering ; Data Management and Quality ; Text Analytics

Abstract: We describe a Text Categorization (TC) classifier that does not require a target function. When performing TC, there is a set of predefined, labeled categories that the documents need to be assigned to. Automated TC can be done by either describing fixed classification rules or by applying machine learning. Machine learning based TC usually occurs in a supervised learning fashion. The learner generally uses example document-to-category assignments (the target function) for training. When TC is introduced for any application or when new topics emerge, such examples are not easy to obtain because they are time-intensive to create and can require domain experts. Unsupervised document classification eliminates the need for such training examples. We describe a method capable of performing unsupervised machine learning-based TC. Our method provides quick, tangible classification results that allow for interactive user feedback and result validation. After uploading a document, the user ca n agree or correct the category assignment. This allows our system to incrementally create a target function that a regular supervised learning classifier can use to produce better results than the initial unsupervised system. To do so, the classifications need to be performed in a time acceptable for the user uploading documents. We based our method on word embedding semantics with three different implementation approaches; each evaluated using the reuters21578 benchmark (Lewis, 2004), the MAUI citeulike180 benchmark (Medelyan et al., 2009), and a self-compiled corpus of 925 scientific documents taken from the Cornell University Library arXiv.org digital library (Cornell University Library, 2016). Our method has the following advantages: Compared to key word extraction techniques, our system can assign documents to categories that are labeled with words that do not literally occur in the document. Compared to usual supervised learning classifiers, no target function is required. Without the requirement of a target function the system cannot overfit. Compared to document clustering algorithms, our method assigns documents to predefined categories and does not create unlabeled groupings of documents. In our experiments, the system achieves up to 66.73 % precision, 41.8 % recall and 41.09% F1 (all reuters21578) using macroaveraging. Using microaveraging, similar effectiveness is obtained. Even though these results are below those of contemporary supervised classifiers, the system can be adopted in situations where no training data is available or where text needs to be assigned to new categories capturing freshly emerging knowledge. It requires no manually collected resources and works fast enough to gather feedback interactively thereby creating a target function for a regular classifier. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 3.141.100.120

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Eljasik-Swoboda, T.; Kaufmann, M. and Hemmje, M. (2018). No Target Function Classifier - Fast Unsupervised Text Categorization using Semantic Spaces. In Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-318-6; ISSN 2184-285X, SciTePress, pages 35-46. DOI: 10.5220/0006847000350046

@conference{data18,
author={Tobias Eljasik{-}Swoboda. and Michael Kaufmann. and Matthias Hemmje.},
title={No Target Function Classifier - Fast Unsupervised Text Categorization using Semantic Spaces},
booktitle={Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA},
year={2018},
pages={35-46},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006847000350046},
isbn={978-989-758-318-6},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA
TI - No Target Function Classifier - Fast Unsupervised Text Categorization using Semantic Spaces
SN - 978-989-758-318-6
IS - 2184-285X
AU - Eljasik-Swoboda, T.
AU - Kaufmann, M.
AU - Hemmje, M.
PY - 2018
SP - 35
EP - 46
DO - 10.5220/0006847000350046
PB - SciTePress