When there is a sufficiently large target function, the
system can switch to a classic supervised
classification algorithm like SVM to mimic the
users’ document classification behavior. The
decision of which system to use can be answered by
comparing the supervised learning classifier results
to the NTFC results in the background. As soon as
the classifier can outperform the NTFC, the system
can switch to the regular classifier. Alternatively, the
supervised learning classifier and the NTFC can
form a classifier committee. Because the NTFC
cannot overfit, this can prevent the regular classifier
from overfitting. The model can also be used to
extract potential new categories from an existing text
corpus. Documents can be clustered in semantic
space and cluster means in can be computed. These
cluster means can be used to find terms most
descriptive for the cluster. Clusters can then be
regarded as categories while the words closest to the
cluster mean can be used as category labels. The
nature of semantic spaces allows assessing the
relationships between the clusters. For example
hyponymy- and hypernymy relationships between
the labels of different categories. We intent to
investigate this further in future works as well as
extending the NTFC to work with multiple clusters
for text representation as proposed by Dai et al.,
(2017). Different to Dai et al.’s work, we will try to
minimize the necessity of external knowledge to
parameterize the solution.
REFERENCES
Bellman, R. (1961). Adaptive Control Processes. A
Guided Tour, Princeton University Press, USA.
Blei, D.M., Ng, A.Y., Jordan, M. I. (2003) Latent Dirichlet
Allocation, Journal of Machine Learning Research,
vol. 3, pp. 993-1022, doi:10.1162/jmlr.2003.3.4-5.993.
Busse J., Humm, B., Lübbert, C., Moelter, F., Reibold, A.,
Rewald, M., Schlüter, V., Seiler, B., Tegtmeier, E.,
Zeh, T. (2015). Actually, What Does “Ontology“
Mean? A Term Coined by Philosophy in the Light of
Different Scientific Disciplines. In: ournal of
Computing and Information Technology – CIT 23, pp.
29-41 doi:10.2498/cit.1002508.
Cho, K., Kim, J. (1997). Automatic text categorization on
hierarchical category structure by using ICF (inverse
category frequency) weighting. In: Proceedings of
KISS conference. pp. 507-510.
Cornell University Library (2016) arXiv.org [online]
Available at: https://arxiv.org [Accessed 15 Dec.
2016]
Dai, X., Bikdash, M., Meyer, M. (2017).From social
media to public health surveillance: Word embedding
based clustering method for twitter classification. In
Proceedings SoutheastCon, pp. 1-7,
doi:10.1109/SECON.2017.7925400.
DFG (2016) Schwerpunktprogramm “Robust
Argumentation Machines” (SPP 1999), 27 June.
[online] Available at: http://www.dfg.de/
foerderung/info_wissenschaft/2016/info_wissenschaft
_16_38/index.html [Accessed 6 Mar. 2018]
Dumais, S. T. (2005). Latent Semantic Analysis. In:
Annual Review of Information Science and
Technology, vol. 38, pp. 188-230 DOI:
10.1002/aris.1440380105.
Egozi, O, Markovitch, S., Gabrilovich, E. (2011).
Concept-Based Information Retrieval using Explicit
Semantic Analysis. In ACM Transactions on
Information Systems, vol. 29, pp. 8:1-8:34. doi:
10.1145/1961209.1961211.
Gabrilovich, E., Markovitch, S., (2006). Overcoming the
brittleness bottleneck using Wikipedia: Enhancing text
categorization with encyclopedic knowledge. In: AAAI
Vol. 6, pp. 1301-1306.
Goldberg, Y, Levy, O. (2014). word2vec Explained,
Deriving Mikolov et al.’s Negative-Sampling Word-
Embedding Method [online] Available at:
https://arxiv.org/pdf/1402.3722v1.pdf [Accessed 25
Jan. 2017]
Ko, Y., Seo, J. (2009). Text classification from unlabeled
documents with bootstrapping and feature projection
techniques. In Journal of Information Processing and
Management 45, pp. 70-83.
Kusner, M. J., Sun, Y., Kolkin, N., Weinberger, K. Q.
(2015). From Word Embeddings To Document
Distances. In Proceedings of the 32nd International
Conference on Machine Learning, Lille, France.
Lewis, D. (2004) The Reuters-21578 text categorization
benchmark. [online] Available at: http://www.
daviddlewis.com/resources/testcollections/reuters2157
8/reuters21578.tar.gz [Accessed 02 Aug. 2017]
McCallum, A., Nigam, K. (1999). Text Classification by
Bootstrapping with Keywords, EM and Shrinkage. In:
Workshop On Unsupervised Learning In Natural
Language Processing, pp. 52-58.
Medelyan, O., Frank, E., Witten, I. H., (2009). Human-
competitive tagging using automatic keyphrase
extraction. In Conference on Empirical Methods in
Natural Language Processing EMNLP 09. Singapore.
pp. 1318-1327.
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013).
Efficient Estimation of Word Representation in Vector
Space. In: Proceedings of Workshop at ICLR. [online]
Available at: http://arxiv.org/pdf/1301.3781.pdf
[Accessed 29 Dec. 2015]
Mikolov, T. (2013) Word2Vec C Code [online] Available
at: https://code.google.com/archive/p/word2vec/
source/default/source [accessed 05 Dec. 2015]
Mohri, M., Rostamizadeh, A., Talwalkar, A. (2012).
Foundations of Machine Learning, MIT Press,
Cambridge, Massachusetts, USA.
Nadeau, D., Sekine, S. (2007). A survey of named entity
recognition and classification, Lingvisticae
Investigationes, 30(1), pp. 3-26.