The graph emphases that all tested variants of
proposed classifier outperformed C5.0 tree by 10%-
30%. Other tested standard classifiers were
outperformed as well especially when small number
of topics was used. If the reduction of
dimensionality is not so significant the information
about the word order in documents does not improve
the classification much.
8 CONCLUSIONS
We proposed network representation of text
documents that contains information about
sequences of tokens and enables to exploit extracted
features produced by latent semantic analysis. Then
we illustrated how the network representation helps
to improve the accuracy of classification.
If information about context was present in input
features classifiers performed considerably better
especially when the dimensionality reduction was
significant. We achieved improvement 10-30% in
comparison with standard representation combined
with kNN or C5.0 algorithms.
The size of context window does not influence
the classification accuracy so considerably. We
observed that larger context implies slightly better
classifier. The largest context of ten tokens
outperformed the shortest context of two tokens by
2% in average.
The possible modifications of proposed method
include:
tokenization of documents to n-grams instead
of words before SVD and context networks
are applied,
application of different methods of
construction of context topic vector u.
Our future work will focus to improvements of the
algorithm to speed up the construction and
comparison of larger context networks.
REFERENCES
Berry, P. M., Harrison, I., Lowrance, J. D., Rodriguez, A.
C., & Ruspini, E. H. (2004). Link Analysis Workbench.
Air Force Research Laboratory.
Burt, R. S. (1978). Cohesion Versus Structural
Equivalence as a Basis for Network Subgroups.
Sociological Methods and Research, 7, pp. 189-212.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., &
Harshman, R. (1990). Indexing by Latent Semantic
Analysis. Journal of the American Society for
Information Science, 41, pp. 391-407.
Eibe, F., & Remco, B. (2006). Naive Bayes for Text
Classification with Unbalanced Classes. Proceedings
of 10th European Conference on Principles and
Practice of Knowledge Discovery in Databases (pp.
503-510). Berlin: Springer.
Gaizauskas, R., & Wilks, Y. (1998). Information
extraction: beyond document retrieval. Journal of
Documentation, 54(1), pp. 70-105.
Han, E., Karypis, G., & Kumar, V. (2001). Text
Categorization Using Weight Adjusted k-Nearest
Neighbor Classification. Proceedings of 5th Pacific-
Asia Conference on Knowledge Discovery and Data
Mining (pp. 53-65). Springer-Verlag.
Kelleher, D. (2004). Spam Filtering using Contextual
Network Graphs.
Landauer, T., Foltz, P., & Laham, D. (1998). An
Introduction to Latent Semantic Analysis. Discourse
Processes, 25, pp. 259-284.
Marin, A. (2011). Comparison of Automatic Classifiers’
Performances using Word-based Feature Extraction
Techniques in an E-government setting. Kungliga
Tekniska Högskolan.
Salton, G., & Buckley, C. (1988). Term-weighting
Approaches in Automatic Text Retrieval. Information
Processing & Management, 24(5), pp. 513-523.
Vapnik, V. N. (1995). The Nature of Statistical Learning
Theory. Springer-Verlag.
Wasserman, S., & Faust, K. (1994). Social Network
Analysis: Methods and Applications. Cambridge
University Press.
Weiss, S., Indurkhya, N., Zhang, T., & Damerau, F.
(2005). Text Mining. Springer.
Yang, Y., & Pedersen, J. O. (1997). A Comparative Study
on Feature Selection in Text Categorization.
Proceedings of the Fourteenth International
Conference on Machine Learning (pp. 412--420).
Morgan Kaufmann Publishers.
Zhang, T., & Oles, F. J. (2000). Text Categorization Based
on Regularized Linear Classification Methods.
Information Retrieval, 4, pp. 5-31.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
430