Contextual Latent Semantic Networks used for Document Classification
Ondrej Hava, Miroslav Skrbek, Pavel Kordik
2012
Abstract
Widely used document classifiers are developed over a bag-of-words representation of documents. Latent semantic analysis based on singular value decomposition is often employed to reduce the dimensionality of such representation. This approach overlooks word order in a text that can improve the quality of classifier. We propose language independent method that records the context of particular word into a context network utilizing products of latent semantic analysis. Words' contexts networks are combined to one network that represents a document. A new document is classified based on a similarity between its network and training documents networks. The experiments show that proposed classifier achieves better performance than common classifiers especially when a foregoing reduction of dimensionality is significant.
References
- Berry, P. M., Harrison, I., Lowrance, J. D., Rodriguez, A. C., & Ruspini, E. H. (2004). Link Analysis Workbench. Air Force Research Laboratory.
- Burt, R. S. (1978). Cohesion Versus Structural Equivalence as a Basis for Network Subgroups. Sociological Methods and Research, 7, pp. 189-212.
- Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41, pp. 391-407.
- Eibe, F., & Remco, B. (2006). Naive Bayes for Text Classification with Unbalanced Classes. Proceedings of 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (pp. 503-510). Berlin: Springer.
- Gaizauskas, R., & Wilks, Y. (1998). Information extraction: beyond document retrieval. Journal of Documentation, 54(1), pp. 70-105.
- Han, E., Karypis, G., & Kumar, V. (2001). Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Proceedings of 5th PacificAsia Conference on Knowledge Discovery and Data Mining (pp. 53-65). Springer-Verlag.
- Kelleher, D. (2004). Spam Filtering using Contextual Network Graphs.
- Landauer, T., Foltz, P., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 25, pp. 259-284.
- Marin, A. (2011). Comparison of Automatic Classifiers' Performances using Word-based Feature Extraction Techniques in an E-government setting. Kungliga Tekniska Högskolan.
- Salton, G., & Buckley, C. (1988). Term-weighting Approaches in Automatic Text Retrieval. Information Processing & Management, 24(5), pp. 513-523.
- Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag.
- Wasserman, S., & Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge University Press.
- Weiss, S., Indurkhya, N., Zhang, T., & Damerau, F. (2005). Text Mining. Springer.
- Yang, Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of the Fourteenth International Conference on Machine Learning (pp. 412--420). Morgan Kaufmann Publishers.
- Zhang, T., & Oles, F. J. (2000). Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval, 4, pp. 5-31.
Paper Citation
in Harvard Style
Hava O., Skrbek M. and Kordik P. (2012). Contextual Latent Semantic Networks used for Document Classification . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012) ISBN 978-989-8565-29-7, pages 425-430. DOI: 10.5220/0004109304250430
in Bibtex Style
@conference{sstm12,
author={Ondrej Hava and Miroslav Skrbek and Pavel Kordik},
title={Contextual Latent Semantic Networks used for Document Classification},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)},
year={2012},
pages={425-430},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004109304250430},
isbn={978-989-8565-29-7},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2012)
TI - Contextual Latent Semantic Networks used for Document Classification
SN - 978-989-8565-29-7
AU - Hava O.
AU - Skrbek M.
AU - Kordik P.
PY - 2012
SP - 425
EP - 430
DO - 10.5220/0004109304250430