news on long-term market trends using different tech-
niques of text classification. We continued the re-
search presented in (Kroha and Baeza-Yates, 2005),
(Kroha et al., 2006), (Kroha and Reichel, 2007),
(Kroha et al., 2007) that resulted in the conclusion
that the quotient between good and bad news starts to
recover some months sooner before the markets start
to grow up again.
In (Kroha and Reichel, 2007), we introduced our
grammar-driven text classification method for market
news in English and found that the curve representing
the quotient between positive and negative news starts
to grow about some months before the markets start
to grow. Because we investigate long-term trends and
because we have our data staring in 1999 we found
only one such situation described above.
Ever since, the prime crisis has come and there is
another down trend (2007 - 2009) and another break-
ing point (March 2009) in markets.
In the investigation described in this paper, we
want to answer the following questions:
• How easy or difficult it is to construct a gram-
mar for our grammar-driven classification of tex-
tual market news when it should be done for Ger-
man language (complex declination, conjugation,
irregular plurals)? We were trying to answer it for
German language because we have about 554.000
news available.
• When we change not only the language of news
but also the time interval do we get similar re-
sults compared with our previous? This means,
we were curious whether the dependences found
for english market news from the time interval
1999 - 2006 can be observed in german market
news from the time interval 1999 - 2009, too.
We used the grammar-driven method like we did in
(Kroha and Reichel, 2007) but we constructed a gram-
mar for classification of news written in German. Fi-
nally, we found that the conclusion from analysing
market news in English from 1999 - 2006 can be
confirmed through experiments with market news in
German from 1999 - 2009. As we will show, the
news indicator (good news/bad news) processed by
our grammar-driven text classification method seems
to have interesting features that could be used for fore-
casting because the news indicator changes its trend
some months before the market changes its trend.
The rest of the paper is organized as follows. In
Section 2 we discuss related work. In Section 3 we
briefly explain why we want to use a grammar-driven
method. In Section 4 we take a look to the imple-
mented system which proceed all messages. Section 5
describes our results in comparison to the DAX stock
index. In the last section we present some measure-
ment results and conclude.
2 RELATED WORK
In related papers, the approach to classification of
market news is similar to the approach to document
relevance. Experts construct a set of keywords which
they think are important for moving markets. The
occurrences of such a fixed set of several hundreds
of keywords will be counted in every message. The
counts are then transformed into weights. Finally, the
weights are the input into a prediction engine, which
forecasts which class the analyzed message should be
assigned to.
In (Nahm and Mooney, 2001) and (Nahm, 2002),
a small number of documents was manually annotated
(we can say indexed) and the obtained index, i.e. a
set of keywords, will be induced to a large body of
text to construct a large structured database for data
mining. The authors work with documents contain-
ing job posting templates. A similar procedure can
be found in (Macskassy and Provost, 2001). The key
to his approach is the user’s specification to label his-
torical documents. These data then form a training
corpus to which inductive algorithms will be applied
to build a text classifier.
In (Lavrenko et al., 2000), a set of news is corre-
lated with each trend. The goal is to learn a language
model correlated with the trend and use it for predic-
tion. A language model determines the statistics of
word usage patterns among the news in the training
set. Once a language model has been learned for ev-
ery trend, a stream of incoming news can be mon-
itored and it can be estimated which of the known
trend models is most likely to generate the story.
Compared to our investigation, there are two differ-
ences. One difference is that Lavrenko uses his mod-
els of trends and corresponding news only for day
trading. The weak point of this approach is that it is
not clear how quickly the market responds to news re-
leases. The next difference is that our grammar-driven
method respects the structure of a sentence that can
have a fundamental influence on the meaning of the
sentence.
In our previous work (Kroha and Baeza-Yates,
2005), we have been experimenting with statistical
methods of text classification that are based on the
frequency of terms to distinguish between positive
news and negative news in terms of long-term mar-
ket trends. In (Kroha and Reichel, 2007), we pre-
sented a grammar-driven text mining method, i.e. we
have built a grammar that describes templates typical
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
188