compromises accuracy on the validation set. Feature
Selection reduces the number of features, but
increases accuracy, since it only takes out less
relevant features. Thus, over-fitting might be
actually reduced by Feature Selection.
For single words, Feature Selection is not
beneficial. It still reduces accuracy values in the
training set. However, this could be attributed to the
pure reduction in the number of features.
An important remark relates to computational
complexity. While Feature Selection, Feature
Representation and the final classification by the
SVM are of polynomial complexity (Burges 1998),
major differences arise for Feature Extraction.
Computational cost is mainly driven by the number
of words per text message, number of used features
and the corpus size, i.e. the number of total
messages. As the corpus size is a linear complexity
factor for all Feature Extraction methods, it’s not
considered in detail.
Bag-of-words and 2-Grams run in O(M*F) with
M as the number of words per message and F as the
number of considered features. For extraction of 2-
word combinations, complexity increases to
O(M*W*F) with W as the maximum distance
between two words. However, the time consumed by
the part of speech tagger task cannot be bounded by
a polynomial (Klein & Manning 2003). Thus, Noun
Phrases come at very high cost despite lower
validation accuracies than 2-word combinations.
5 CONCLUDING REMARKS
In summary, our research shows that the
combination of advanced Feature Extraction
methods and our feedback-based Feature Selection
boosts classification accuracy and allows improved
sentiment analytics. Feature Selection significantly
improves classification accuracies for different
feature types (2-Gram, Noun Phrases and 2-word-
combinations) from 55-58% up to 62-65%. These
results were possible because our approach allows
reducing the number of less-explanatory features,
i.e. noise, and thus, may limit negative effects of
over-fitting when applying machine learning
approaches to classify text messages.
Our text mining approach was demonstrated in
the field of capital markets – an area with numerous,
direct and verifiable exogenous feedback. Such
feedback is essential to develop, improve and test a
text mining approach. However, since our approach
is multi-applicable, it can be used on different data
sets like marketing, customer relationship
management, security and content handling. Future
research will transfer our findings to these areas.
REFERENCES
Burges, C. 1998. “A Tutorial on Support Vector Machines
for Pattern Recognition”, Data Mining and Knowledge
Discovery 2, pp. 121-167
Butler, M., Keselj, V. 2009. “Financial Forecasting using
Character N-Gram Analysis and Readability Scores of
Annual Reports”, Advances in AI
Cawley, G., Talbot, N. 2007. “Preventing Over-Fitting
during Model Selection via Bayesian Regularisation of
the Hyper-Parameters”, Journal of Machine Learning
Research 8, pp.841-861
Forman, G, 2003. “An extensive empirical study of feature
selection metrics for text classification”, Journal of
Machine Learning Research 3, pp. 1289-1305
Groth, S., Muntermann, J. 2011. “An Intraday Risk
Management Approach Based on Textual Analysis”,
Decision Support Systems 50, p. 680
Joachims, T., 1998. “Text categorization with support
vector machines: Learning with many relevant
features”, Proceedings of the European Conference on
Machine Learning
Klein, D. & Manning, C. D. 2003. “Accurate
Unlexicalized Parsing”, Proceedings of the 41st
Meeting of the Association for Computational
Linguistics, pp. 423-430.
MacKinlay, C. A. 1997. “Event Studies in Economics and
Finance”, Journal of Economic Literature, S. 13-39.
Mittermayr, M.-A. 2004. “Forecasting Intraday Stock
Price trends with Text Mining techniques”,
Proceedings of the 37th Annual Hawaii International
Conference on System Sciences
Muntermann, J., Guettler, A., 2009. “Supporting
Investment Management Processes with Machine
Learning Techniques”, 9. Internationale Tagung
Wirtschaftsinformatik
Porter, M. F. 1980. “An Algorithm for Suffix Stripping”,
Program, 14(3): 130–137
Schumaker, R. P., Chen, H. 2009. “Textual analysis of
stock market prediction using breaking financial news:
the AZFin Text System”, ACM Transactions on
Information Systems 27
Tetlock, P. C., Saar-Tsechansky, M. & Macskassy, S,
2008. “More than words: Quantifying Language to
Measure Firms’ Fundamentals”, The Journal of
Finance, Volume 63, Number 3, June 2008 , pp. 1437-
1467
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
308