
compromises accuracy on the validation set. Feature 
Selection reduces the number of features, but 
increases accuracy, since it only takes out less 
relevant features. Thus, over-fitting might be 
actually reduced by Feature Selection. 
For single words, Feature Selection is not 
beneficial. It still reduces accuracy values in the 
training set. However, this could be attributed to the 
pure reduction in the number of features. 
An important remark relates to computational 
complexity. While Feature Selection, Feature 
Representation and the final classification by the 
SVM are of polynomial complexity (Burges 1998), 
major differences arise for Feature Extraction. 
Computational cost is mainly driven by the number 
of words per text message, number of used features 
and the corpus size, i.e. the number of total 
messages. As the corpus size is a linear complexity 
factor for all Feature Extraction methods, it’s not 
considered in detail. 
Bag-of-words and 2-Grams run in O(M*F) with 
M as the number of words per message and F as the 
number of considered features. For extraction of 2-
word combinations, complexity increases to 
O(M*W*F) with W as the maximum distance 
between two words. However, the time consumed by 
the part of speech tagger task cannot be bounded by 
a polynomial (Klein & Manning 2003). Thus, Noun 
Phrases come at very high cost despite lower 
validation accuracies than 2-word combinations. 
5 CONCLUDING REMARKS 
In summary, our research shows that the 
combination of advanced Feature Extraction 
methods and our feedback-based Feature Selection 
boosts classification accuracy and allows improved 
sentiment analytics. Feature Selection significantly 
improves classification accuracies for different 
feature types (2-Gram, Noun Phrases and 2-word-
combinations) from 55-58% up to 62-65%. These 
results were possible because our approach allows 
reducing the number of less-explanatory features, 
i.e. noise, and thus, may limit negative effects of 
over-fitting when applying machine learning 
approaches to classify text messages.  
Our text mining approach was demonstrated in 
the field of capital markets – an area with numerous, 
direct and verifiable exogenous feedback. Such 
feedback is essential to develop, improve and test a 
text mining approach. However, since our approach 
is multi-applicable, it can be used on different data 
sets like marketing, customer relationship 
management, security and content handling. Future 
research will transfer our findings to these areas. 
REFERENCES 
Burges, C. 1998. “A Tutorial on Support Vector Machines 
for Pattern Recognition”, Data Mining and Knowledge 
Discovery 2, pp. 121-167 
Butler, M., Keselj, V. 2009. “Financial Forecasting using 
Character N-Gram Analysis and Readability Scores of 
Annual Reports”, Advances in AI 
Cawley, G., Talbot, N. 2007. “Preventing Over-Fitting 
during Model Selection via Bayesian Regularisation of 
the Hyper-Parameters”, Journal of Machine Learning 
Research 8, pp.841-861 
Forman, G, 2003. “An extensive empirical study of feature 
selection metrics for text classification”, Journal of 
Machine Learning Research 3,  pp. 1289-1305 
Groth, S., Muntermann, J. 2011. “An Intraday Risk 
Management Approach Based on Textual Analysis”, 
Decision Support Systems 50, p. 680 
Joachims, T., 1998. “Text categorization with support 
vector machines: Learning with many relevant 
features”, Proceedings of the European Conference on 
Machine Learning 
Klein, D. & Manning, C. D. 2003. “Accurate 
Unlexicalized Parsing”, Proceedings of the 41st 
Meeting of the Association for Computational 
Linguistics, pp. 423-430. 
MacKinlay, C. A. 1997. “Event Studies in Economics and 
Finance”, Journal of Economic Literature, S. 13-39. 
Mittermayr, M.-A. 2004. “Forecasting Intraday Stock 
Price trends with Text Mining techniques”, 
Proceedings of the 37th Annual Hawaii International 
Conference on System Sciences  
Muntermann, J., Guettler, A., 2009. “Supporting 
Investment Management Processes with Machine 
Learning Techniques”, 9. Internationale Tagung 
Wirtschaftsinformatik 
Porter, M. F. 1980. “An Algorithm for Suffix Stripping”, 
Program, 14(3): 130–137  
Schumaker, R. P., Chen, H. 2009. “Textual analysis of 
stock market prediction using breaking financial news: 
the AZFin Text System”, ACM Transactions on 
Information Systems 27 
Tetlock, P. C., Saar-Tsechansky, M. & Macskassy, S, 
2008. “More than words: Quantifying Language to 
Measure Firms’ Fundamentals”, The Journal of 
Finance, Volume 63, Number 3, June 2008 , pp. 1437-
1467 
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
308