IMPACT OF FEATURE SELECTION AND FEATURE TYPES ON
FINANCIAL STOCK PRICE PREDICTION
Michael Hagenau, Michael Liebmann and Dirk Neumann
Institute of Information Systems Research, University of Freiburg, Freiburg, Germany
Keywords: Text mining, Machine learning, Financial news, Stock price effect.
Abstract: In this paper, we examine whether stock price effects can be automatically predicted analyzing unstructured
textual information in financial news. Accordingly, we enhance existing text mining methods to evaluate the
information content of financial news as an instrument for investment decisions. The main contribution of
this paper is the usage of more expressive features to represent text through the employment of market
feedback as part of our word selection process. In a comprehensive benchmarking, we show that a robust
Feature Selection allows lifting classification accuracies significantly above previous approaches when
combined with complex feature types. That is because our approach allows selecting only semantically
relevant features and thus, reduces the problem of over-fitting when applying a machine learning approach.
The methodology can be transferred to any other application area providing textual information and
corresponding effect data.
1 INTRODUCTION
News plays an important role for investors when
judging fair stock prices. In fact, news carries
information about the firm’s fundamentals and
expectations of other market participants. From a
theoretical point of view, an efficient valuation of a
firm should be equal to the present value of the
firm’s expected future cash flows. The expectations
crucially depend on the information set that is
available to investors. The set consists of qualitative
and quantitative information of different kind and
from various sources, e.g. corporate disclosures,
news articles and analyst reports. Due to improved
information intermediation, the amount of available
information – especially qualitative information –
increased during the last decades. Since it is getting
increasingly difficult for investors to follow and to
take into account all the information available,
automated classification of the most important
information becomes more relevant.
Research in this area is still in its infancy.
Despite numerous attempts, prediction accuracies
for the stock price effect (i.e. positive or negative)
following the release of corporate financial news
never exceeded 58% (see Table 1) – an accuracy
level still close to random guessing probability for
two predictive states (50%) leaving room for
substantial improvements.
Automated text mining translates unstructured
information into a machine readable format and
mostly uses machine learning techniques for
classification. While suitable machine learning
techniques for text classification are well established
(Forman 2003; Joachims 1998), the development of
suitable text representations is still part of ongoing
research (Schumaker et al., 2009). In particular,
determining the feature type (e.g. single words or
word combinations) and choosing the most relevant
features to represent text is the crucial part.
Existing literature on financial text mining
mostly relies on very simple textual representations,
such as bag-of-words (i.e. distinct single words).
Further, the list of words or word combinations to
actually represent text is selected based on
dictionaries (Tetlock et al., 2008) or retrieved from
the message corpus based on actual occurrences.
Despite well researched approaches to select the
most relevant words or word combinations based on
exogenous feedback (Forman, 2003), existing work
relies on frequency-based statistics of the message
corpus, such as TF-IDF (Mittermayr, 2004) or just a
minimum occurrence of a word combination
(Schumaker, 2009). Thus, we expect potential for
improvement in two areas: First, more complex and
303
Hagenau M., Liebmann M. and Neumann D..
IMPACT OF FEATURE SELECTION AND FEATURE TYPES ON FINANCIAL STOCK PRICE PREDICTION.
DOI: 10.5220/0003665602950300
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 295-300
ISBN: 978-989-8425-79-9
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
expressive features (e.g. Noun Phrases, word
combinations) also capturing semantics should be
used for text representation. Second, these features
should be combined with a robust Feature Selection
procedure to pick those features best discriminating
between news messages leading to positive or
negative stock price effects. As outside feedback
from the stock market is needed to determine if a
message was positive or negative, the Feature
Selection method cannot rely on frequency-based
statistics of the corpus, but has to utilize exogenous
market feedback instead.
As every scholar tailors his methodology on his
own data set and therefore is only vaguely
comparable to previous results, we rebuild previous
approaches in our evaluation to allow for a direct
same-data benchmarking. We employ a data set of
corporate disclosures which only contains firm-value
relevant facts and therefore is very suitable for
developing, improving and testing our approach.
The remainder of the paper is structured as
follows: In section 2, we conduct a review of
relevant research on prediction of stock price effects
based on qualitative information. Section 3 designs
our own approach for analyzing qualitative
information and pinpoints the main innovations
compared to existing work. In section 4, we present
our analyses and findings from the comparison with
existing approaches. Section 5 summarizes and
outlines directions for further research on media
content.
2 RELATED WORK &
RESEARCH QUESTIONS
In this section, we give an overview on existing
literature and pinpoint the differences to our
approach. Our work is most closely related to
Schumaker & Chen (2009) who also has the highest
accuracy for stock price prediction based on
financial news so far. The authors are one of the first
to explore the impact of different Feature Extraction
methods forming the basis for their Support Vector
Machine (SVM) classification. Besides the
extraction of single words and named entities, a
proprietary tool was used to identify and aggregate
noun phrases based on lexical semantic and syntactic
tagging. However, feature selection remained rather
simple: Only those features were selected that
occurred at least three times in a document.
Prediction accuracy did not exceed 58.2%. We
mainly differ from Schumaker & Chen by applying
exogenous-feedback-based Feature Selection to limit
our feature set to the most relevant. Additionally, we
find value in also including verbs into our features,
unlike Schumaker’s Noun Phrases and Named
Entities. Our features are based on 2-word
combinations which may occur with word distances
of greater than zero. These word combinations are
not limited to nouns, articles, and other determiners,
but also may include verbs. Another closely related
study was performed by Muntermann et al. (2009)
who focus on the same news type (German Adhoc
announcements) to have verifiable stock price
effects. However, the authors’ research can hardly
be generalized due to its fairly small sample size of
only 423 messages which need to be divided into
training and validation set. Despite relying on the
same data source as our work, with 56.5% accuracy,
results are in the range of random guessing
probability. Unlike Muntermann et al. (2009),
Mittermayr (2004) employs a feature select to focus
on relevant words: The TF IDF score which relates
the occurrences of one term in processed document
to the occurrence in all documents of the data set.
However, prediction accuracy for positive and
negative events is not directly specified in a
comparable manner. Tetlock et al. (2008) use
negative words in Wall Street Journal and Dow
Jones News articles to create a content measure and
predict stock returns. The content measure classifies
messages as positive or negative based on the
Harvard-IV-4 psychosocial dictionary – a selection
of words widely used in psychological studies.
Instead of prediction accuracies, the authors specify
an R² of 0.24% between their content measure and
the observed stock returns. A similar text message
base, but different capital market effect predictions
are used by Groth et al. (2011). Groth et al. predict
intraday market risk based on German Adhoc
announcements and use single words as features.
Like Muntermann et al. (2009), the authors do not
perform any Feature Selection besides the removal
of stopwords. Accuracy values are not comparable
due to different classification task, i.e. the absolute
accuracy values may seem higher, but are achieved
on subsets of the data.
With improved text mining technology and a
relevant data set, we achieve prediction accuracies
significantly higher than in literature. Existing work
in prediction of stock prices has rarely used a robust
Feature Selection to choose the most relevant
features yet. As the number of possible
combinations increases for more complex and
expressive features, it becomes more relevant to
select the features that could discriminate best
between positive and negative effects. In our first
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
304
research question, we examine the impact of Feature
Selection for different feature types:
Question 1: Does Feature Selection improve
accuracies for more complex features than single
words?
Prior research has almost exclusively relied on bag-
of-words approach. Consistent with Schumaker &
Chen (2009), we expect better predictive abilities for
more complex features also capturing semantics in
the text. This leads to our second research question:
Question 2: What is the impact of different feature
types on classification accuracy?
The high number of possible combinations for
complex features (such as 2-Grams, noun phrases or
2-word combinations) drives down actual
occurrences in the overall message corpus increasing
the risk of over-fitting. Over-fitting describes the
fact that machine learning algorithms learn relations
and structural dependencies in the training set which
do not exist in reality and therefore can’t be
transferred onto the validation set. Over-fitting
occurs when a larger number of features is used for
learning than messages in the training set (i.e. high
number of degrees of freedom, Cawley & Talbot,
2007). This leads to the third research question:
Question 3: Does Feature Selection reduce over-
fitting?
The following section describes our approach to
address these research questions.
3 METHODOLOGY
Analyzing unstructured information in the shape of
text requires a complex processing algorithm. In
order to classify text, exogenous feedback as base
for the classification is required. The feedback has to
have a direct cause-and-effect relation to the text.
The algorithm can handle any kind of classification
with two states as long as there is a direct relation
between text and exogenous effect. For
simplification of the two predictable states based on
the text messages, in the following, just “positive”
and “negative” will be mentioned. Consequently, the
corresponding text messages will be named positive
and negative messages.
We design a four step approach in order to
process text messages and combine them with their
exogenous feedback. The four steps can be separated
into three steps of text processing, Feature
Extraction, Feature Selection, Feature
Representation, and the final step of the actual
machine learning: We use a subset of the data (i.e.
text-effect combinations) to train the machine
learning algorithm. After training, the Support
Vector Machine (SVM) is able to classify the
remaining text messages into positive and negative.
We measure the accuracy by comparing our
classification results to the observed effect. The four
steps of our algorithm can be described as following:
In Feature Extraction, we first define the
feature type (e.g. words or word combinations) that
best reflects the content of the message and second
parse all messages to extract their features. We base
our features on all words transported within the body
of each message, i.e. we remove tables and graphs.
During the parsing we extract each word separately.
In order to remove redundancy between words with
the same word stem, but a different commoner or
inflexional ending, we employ the Porter Stemmer
(Porter, 1980). Thus, we extract only word stems.
For the experiment, the following feature extraction
methods are used:
Dictionary-approach – no features are extracted
from the corpus. Instead, single words from the
positive and negative word list in the Harvard-
IV-4 psychosocial dictionary are used (see
Tetlock et al., 2008)
Single words retrieved from the corpus – this
representation which is also called bag-of-words
is most often used in literature (e.g. Groth et al.
2011; Mittermayr 2004; Muntermann 2009)
N-Grams – a sequence of N words, letters or
syllables (as in Butler et al. 2009). Performance
of 3-Grams was slightly weaker than 2-Grams.
Thus, 2-Grams were used.
2-word combinations – this feature type forms
an extension of the word-based 2-Gram,
allowing a word distance greater than zero
between two words. In contrast to Noun
Phrases, this feature type is not limited to
certain parts of speech, but may also contain
verbs and adverbs – as long as the Feature
Selection attests high explanatory power. This
feature type has not been used in literature yet
Noun-phrases – a phrase whose head is a noun
or a pronoun, optionally accompanied adjectives
or other determiners (as in Schumaker & Chen
2009). Noun Phrases are extracted using the
Stanford Parser (Klein & Manning 2003).
In Feature Selection, we exclude features that
are of a lower explanatory power. As explanatory
power we define the ability to differentiate between
positive and negative messages. First, we take out
stopwords, such as “and” and “or”. Second, we
calculate the explanatory power by using a Chi-
IMPACT OF FEATURE SELECTION AND FEATURE TYPES ON FINANCIAL STOCK PRICE PREDICTION
305
Square based method as in (Forman, 2003).
In Feature Representation, we design a vector
for each message based on all selected features in
step 2. There are numerous methods of representing
a feature within a vector. We found a feature best
represented when using the logarithm of the
feature’s frequency within one message.
In the Machine Learning step, we use a Support
Vector Machine (SVM) on combinations of
messages, represented in feature vectors, and their
consequent stock price effects. We transform the
stock price effect into a binary measure, i.e. ‘0’ for
negative price effect and ‘1’ for positive. We use a
SVM since previous findings confirm it to be the
best available machine learning method for text
classification tasks (Forman 2003; Joachims 1998).
Further, in a pilot study, we compared the
performance of Artificial Neural Networks, Naïve
Bayes and SVMs and found SVMs to be best
performing.
Previous work mostly relies on the bag-of-words
scheme, i.e. uses simple single words to represent
text. The main contribution of this paper is the
combination of advanced Feature Extraction
methods with a customized Feature selection. The
results of the evaluation in the following chapter
show the value-add of Feature Selection for different
Feature types.
4 EVALUATION
4.1 Evaluation Approach
In this evaluation, we apply our methodology to a
set of corporate disclosures. We apply the Chi²-
based Feature Selection to different types of features
which have already been described in literature.
By reproducing approaches in literature and
applying to the same data set, we are able to
benchmark our approach in a same-data comparison.
Every feature extraction approach is conducted once
with feature selection based on market feedback and
once without, i.e. simply by requiring a minimum
occurrence in the corpus per feature (as e.g. in
Schumaker & Chen, 2009; Butler et al., 2009).
Thereby, we can demonstrate the improvements
feasible by selection features based on market
feedback. For exogenous-feedback-based feature
selection, the Chi²-approach is used to choose the
most relevant 10% of features occurring in the
overall message set. If no special feature selection is
performed, only stopwords are removed and all
features with a minimum occurrence of 3 are used
for representation of text messages.
Our data set comprises ~11,000 German Adhoc
news published between 1998 and 2007. We
removed penny stocks and extreme values (based on
a 99%-interval). We required each message to have
a minimum of 50 words in total. We impose these
filters to limit the influence of outliers and avoid
messages that only contain tables. Finally, we
obtained 9,150 Adhoc announcements with
consistent stock price information eligible for our
experiment.
For capturing the announcement effect on
financial markets, it is required to separate firm-
specific effects from market-related effects.
Therefore, we investigate daily abnormal returns on
the day the Adhoc announcement was published
(MacKinlay, 1997).
The stock price effect is used to
create a binary measure of the sign and label all text
messages as either positive or negative.
4.2 Results
Results were obtained by running the SVM with a
linear kernel which delivered best performance for
text classification tasks using a high number of
features (Joachims 1998). Table 1 shows the
classification results on full training (7,100
messages) and validation set (3,050 messages).
Accuracy is measured as percentage of correctly
classified messages. For all five Feature types, we
performed training and validation, once with our
customized Feature Selection and once without (i.e.
using all features with a minimum frequency).
Results are stated as classification accuracies. Only
for the Dictionary approach (single word) we did not
perform our approach as the Dictionary itself is
already a kind of Feature Selection.
In the following, we present our findings that are
directly related to our research questions.
Finding 1: Chi²-based Feature Selection improved
classification accuracies for all feature types
Results show that all feature types benefited from
the Chi²-based Feature Selection, through an
improved accuracy for all validation experiments.
The highest performance on the validation set with
65.2% was achieved for the 2-word combination
with Chi²-based Feature Selection. The 2-word
combination performed slightly better than 2-Grams
(62.6%) and Noun Phrases (63.7%) and significantly
better than the single word approaches. The 2-word
combination benefited most from Feature Selection,
single words least. This observation extend the
findings of Forman (2003) who relied on single
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
306
Table 1: Classification results for different feature types.
FEATURE TYPE SUBSET
ACCURACY WITHOUT
SPECIAL FEATURE
SELECTION
ACCURACY WITH CHI²-
BASED FEATURE
SELECTION
Single words I:
Based on dictionary
Training 62.8% -
Validation 58.1% -
Single words II: Retrieved
from corpus
Training 71.6% 62.8%
Validation 58.6% 58.7%
2-Grams
Training 78.3% 69.7%
Validation 56.8% 62.3%
2-word combinations
Training 87.2% 81.7%
Validation 58.3% 65.1%
Noun Phrases
Training 75.2% 72.1%
Validation 57.7% 63.5%
words as text representation and only found limited
benefits of feature selection in combination with an
SVM as machine learning approach.
Finding 2: Classification accuracy increases with
complexity of features when Feature Selection is
used
Classification performance increases with
complexity and expressiveness of features –
expressiveness meaning the ability of features to
capture and express content and explanatory power.
This is consistent with the findings of a previous
study (Schumaker & Chen 2009) showing an
increased performance for Noun Phrases compared
to single words. However, this performance increase
can only be observed when a Feature Selection is
employed. Without exogenous-feedback-based
Feature Selection performance on validation set is
rather similar for all feature types. Features seem to
develop their expressiveness only after selecting the
most relevant features and, thus, taking out the
noise.
The dictionary (single words I) shows slightly
lower performance (58.1%) than the single words II
retrieved from corpus (58.6%) due to its limited
word set which cannot capture all specifics and
subject lingo of the underlying domain. An even
lower accuracy was achieved by the 2-Grams
without Feature Selection (56.8%) which suffer from
a high number of random combinations with low
expressiveness. Only after selecting those with
highest explanatory power, better accuracies were
reached (62.3%). Without Feature Selection, the 2-
word combinations perform better (58.3%) than 2-
Grams, but slightly worse than the single words. 2-
word combinations may carry more expressiveness
than 2-Grams, but compared to single words, they
also suffer from a high number of random
combinations when used without Feature Selection.
Slightly better performance than 2-Grams was
achieved for Noun Phrases. Due to the high number
of possible combinations, we mostly found low
frequencies for each Noun Phrase in the corpus (i.e.
95% of features with less than five occurrences). A
low number of features representing a text message
limits the ability of the SVM to correctly classify.
Further, in contrast to 2-word combinations, Noun
Phrases lack verbs and adverbs limiting their
expressiveness.
Finding 3: Using Chi²-based Feature Selection
indicates to reduce over-fitting
When using Feature Selection, we observe lower
accuracy values in the training set. However, we
also observe higher accuracy values on the
validation set for complex feature types. This
indicates that over-fitting in the training set has been
reduced.
The risk of over-fitting increases for more
complex features, such as 2-Grams, noun phrases or
2-word combinations. For these features, the higher
number of possible combinations leads to a higher
number of features (but with low frequency in the
corpus). Thus, Feature Selection is needed to choose
the features with highest explanatory power and
allow for high validation accuracies.
We are cautious in stating a full causal
relationship between Feature Selection and the
reduction in over-fitting. It is obvious that just a
reduction of features (without selection the most
relevant) will decrease training accuracy values.
However, just reducing the number of features
IMPACT OF FEATURE SELECTION AND FEATURE TYPES ON FINANCIAL STOCK PRICE PREDICTION
307
compromises accuracy on the validation set. Feature
Selection reduces the number of features, but
increases accuracy, since it only takes out less
relevant features. Thus, over-fitting might be
actually reduced by Feature Selection.
For single words, Feature Selection is not
beneficial. It still reduces accuracy values in the
training set. However, this could be attributed to the
pure reduction in the number of features.
An important remark relates to computational
complexity. While Feature Selection, Feature
Representation and the final classification by the
SVM are of polynomial complexity (Burges 1998),
major differences arise for Feature Extraction.
Computational cost is mainly driven by the number
of words per text message, number of used features
and the corpus size, i.e. the number of total
messages. As the corpus size is a linear complexity
factor for all Feature Extraction methods, it’s not
considered in detail.
Bag-of-words and 2-Grams run in O(M*F) with
M as the number of words per message and F as the
number of considered features. For extraction of 2-
word combinations, complexity increases to
O(M*W*F) with W as the maximum distance
between two words. However, the time consumed by
the part of speech tagger task cannot be bounded by
a polynomial (Klein & Manning 2003). Thus, Noun
Phrases come at very high cost despite lower
validation accuracies than 2-word combinations.
5 CONCLUDING REMARKS
In summary, our research shows that the
combination of advanced Feature Extraction
methods and our feedback-based Feature Selection
boosts classification accuracy and allows improved
sentiment analytics. Feature Selection significantly
improves classification accuracies for different
feature types (2-Gram, Noun Phrases and 2-word-
combinations) from 55-58% up to 62-65%. These
results were possible because our approach allows
reducing the number of less-explanatory features,
i.e. noise, and thus, may limit negative effects of
over-fitting when applying machine learning
approaches to classify text messages.
Our text mining approach was demonstrated in
the field of capital markets – an area with numerous,
direct and verifiable exogenous feedback. Such
feedback is essential to develop, improve and test a
text mining approach. However, since our approach
is multi-applicable, it can be used on different data
sets like marketing, customer relationship
management, security and content handling. Future
research will transfer our findings to these areas.
REFERENCES
Burges, C. 1998. “A Tutorial on Support Vector Machines
for Pattern Recognition”, Data Mining and Knowledge
Discovery 2, pp. 121-167
Butler, M., Keselj, V. 2009. “Financial Forecasting using
Character N-Gram Analysis and Readability Scores of
Annual Reports”, Advances in AI
Cawley, G., Talbot, N. 2007. “Preventing Over-Fitting
during Model Selection via Bayesian Regularisation of
the Hyper-Parameters”, Journal of Machine Learning
Research 8, pp.841-861
Forman, G, 2003. “An extensive empirical study of feature
selection metrics for text classification”, Journal of
Machine Learning Research 3, pp. 1289-1305
Groth, S., Muntermann, J. 2011. “An Intraday Risk
Management Approach Based on Textual Analysis”,
Decision Support Systems 50, p. 680
Joachims, T., 1998. “Text categorization with support
vector machines: Learning with many relevant
features”, Proceedings of the European Conference on
Machine Learning
Klein, D. & Manning, C. D. 2003. “Accurate
Unlexicalized Parsing”, Proceedings of the 41st
Meeting of the Association for Computational
Linguistics, pp. 423-430.
MacKinlay, C. A. 1997. “Event Studies in Economics and
Finance”, Journal of Economic Literature, S. 13-39.
Mittermayr, M.-A. 2004. “Forecasting Intraday Stock
Price trends with Text Mining techniques”,
Proceedings of the 37th Annual Hawaii International
Conference on System Sciences
Muntermann, J., Guettler, A., 2009. “Supporting
Investment Management Processes with Machine
Learning Techniques”, 9. Internationale Tagung
Wirtschaftsinformatik
Porter, M. F. 1980. “An Algorithm for Suffix Stripping”,
Program, 14(3): 130–137
Schumaker, R. P., Chen, H. 2009. “Textual analysis of
stock market prediction using breaking financial news:
the AZFin Text System”, ACM Transactions on
Information Systems 27
Tetlock, P. C., Saar-Tsechansky, M. & Macskassy, S,
2008. “More than words: Quantifying Language to
Measure Firms’ Fundamentals”, The Journal of
Finance, Volume 63, Number 3, June 2008 , pp. 1437-
1467
KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval
308