A CASE STUDY

Classiﬁcation of Stock Exchange News by Support Vector Machines

P. Kroha, K. Kr

ober and R. Janetzko

Department of Computer Science, University of Technology, Strasse der Nationen 62, 09111 Chemnitz, Germany

Keywords:

Text mining, Text classiﬁcation, Support vector machine, Market news classiﬁcation, Market news classiﬁca-

tion in parallel.

Abstract:

In this paper, we present a case study concerning the classiﬁcation of text messages with the use of Support

Vector Machines. We collected about 700.000 news and stated the hypothesis saying that when markets are

going down then negative messages have a majority and when markets are going up then positive messages

have a majority. This hypothesis is based on the assumption of news-driven behavior of investors. To check

the hypothesis given above we needed to classify the market news. We describe the application of Support

Vector Machines for this purpose including our experiments that showed interesting results. We found that the

news classiﬁcation has some interesting correlation with long-term market trends.

1 INTRODUCTION

Currently, more and more commercially valuable

business news become available on the World Wide

Web in electronic form. However, the volume of busi-

ness news is very large. The ﬁrst problem is to ﬁnd

how much the news really move stock markets. The

next problem is that people do not have time enough

to read all the news. To solve those two problems

the automatic processing of news is necessary. One

of possibilities how to process news automatically is

classiﬁcation.

In our project, we classiﬁed the market news into

two classes; positive news and negative news. But

the news classiﬁcation was not the only goal. Fur-

ther, we investigated how the long-term trends corre-

spond to the news and whether the knowledge gained

from news can be used in an attempt to predict long-

term trends of ﬁnancial markets. Papers already pub-

lished only investigate short-time inﬂuence of small

sets of messages suitable for day-trading. Our novel

approach is using Support Vector Machines (SVM)

for the classiﬁcation of large sets of messages for

long-term market prediction. We used linear SVMs

as well as nonlinear SVMs, both with sequential and

parallel algorithms. We describe our experiments, ex-

perience, and results.

The rest of the paper is organized as follows. In

Section 2, we discuss the related work. In Section

3, we brieﬂy explain the concept of Support Vector

Machines. In Section 4, the implementations we used

are described. Section 5 describes our experiments

in classiﬁcation. In Section 6, we show the relation

between the number of positive resp. negative news

and market trends. We conclude in the last section.

2 RELATED WORK

The Support Vector Machine is a method often used

for text classiﬁcation as has been shown e.g. in

(Joachims, 1998b), (Joachims, 1998a), (Joachims,

2001).

In our previous works (Kroha et al., 2006), (Kroha

et al., 2007) and (Kroha and Reichel, 2007), we suc-

cesfully used other methods for text classiﬁcation but

there was no possibility to run the classiﬁcation algo-

rithms in parallel. In (Kroha et al., 2006), we surpris-

ingly achieved the best results using the Naive Bayes

classiﬁer. But we found that there are important clas-

siﬁcation problems that cannot be decided when using

the term frequency as the only object feature. We de-

veloped the next method (Kroha et al., 2007) based

on grammars that describe the characteristic features

of document classes in another way. In (Kroha and

Reichel, 2007), we compared the results of both meth-

ods.

Using Support Vector Machines (SVM), our goal

was to investigate the parallel approach to the text

331

Kroha P., Kröber K. and Janetzko R. (2010).

A CASE STUDY - Classiﬁcation of Stock Exchange News by Support Vector Machines.

In Proceedings of the 5th International Conference on Software and Data Technologies, pages 331-336

DOI: 10.5220/0003043403310336

 SciTePress

classiﬁcation problem and to compare the perfor-

mance and results of various classiﬁcation methods.

Our experiments in (Janetzko, 2008) and (Kroeber,

2010) have shown that our previously used methods

resulted more or less in the same classiﬁcation. Ad-

ditionally, we obtained information about the rate of

correctly classiﬁed documents.

3 INTRODUCTION INTO THE

CONCEPT OF THE SUPPORT

VECTOR MACHINE

Support Vector Machines (SVM) are supervised

methods of machine learning used for classiﬁcation,

especially for two-class pattern recognition. The con-

cept is based on the linear separation of two sets of

data by a hyperplane in a n-dimensional feature space.

It needs a set of training objects as an input. Objects

to be classiﬁed have to be described as vectors. The

original optimal hyperplane algorithm proposed was

a linear classiﬁer. The case of interest is when the

separation can be made only by a nonlinear function.

The nonlinear classiﬁcation problem in a n-

dimensional feature space can be solved (based on

(Braverman et al., 1964)) in such a way that a set of

input vectors will be nonlinearly transformed into a

feature space that has higher dimension than n where

the classiﬁcation problem becomes linear.

In (Boser et al., 1992) and (Vapnik, 1995), a way

has been suggested how to create nonlinear classiﬁers

by applying the kernel trick to maximum-margin hy-

perplanes. The resulting algorithm is formally sim-

ilar, except that the dot product of the mapping to

the higher dimensional space is replaced by a non-

linear kernel function. Using Support Vector Ma-

chines for text classiﬁcation was deeply investigated

in (Joachims, 2001).

4 IMPLEMENTATIONS USED

The parameters of the separating maximum-margin

hyperplane of the SVM are derived by solving the

appropriate optimization. There exist several spe-

cialized algorithms. For our experiments described

below, we used implementations SVMlight (Vapnik,

1995), GPDT (Seraﬁni et al., 2005) (working se-

quentially) and PGPDT (Zanni and Zanghirati, 2003)

(working in parallel):

• SVMlight implements the SVM presented by

Vapnik in (Vapnik, 1995). Its optimization algo-

rithm was described in (Joachims, 2001). It has

the ability to estimate the error rate, precision, and

recall. SVMlight has already been used for text

classiﬁcation. We used the version 6.02 in our ex-

periments. In the ﬁrst phase, the learning phase,

the model will be built from the training data. For

this process various parameters can be stated, e.g.

kernel function. The model is then used in the

next phase, in the classiﬁcation phase.

• GPDT (Seraﬁni et al., 2005) is able to train Sup-

port Vector Machines using large quantities of

data, some hundred thousand vectors. In compar-

ison to SVMlight, the GPDT supports only train-

ing and buidling of a classiﬁcation model. Other

computations, e.g. regression, precision, recall,

are not included.

• PGPDT (Zanni and Zanghirati, 2003), (Zanni

et al., 2006) is a version of GPDT designed for

the use on MPI-based parallel systems with dis-

tributed memory. One of our goals was to ﬁnd

how much time we can gain when using a parallel

approach.

The nonlinear tests both used GPDT/PGPDT be-

cause we found them to perform best on our prob-

lems.

5 Our Experiments

5.1 Data Preparation

We used the same stock exchange news in Ger-

man language as in our previous work (Kroha et al.,

2006). However, the time interval was extended from

1.11.1999 until 13.7.2007.

It was necessary to convert our data into sparse

vector format used by the implementations described

above. We separated the news, we built the feature

vectors and the classes, and produced the training and

test data. We used a stopwords list of about 650 items.

Lastly, we speciﬁed the 10.000 most frequently used

words in our news and ignored all the other words.

For producing feature vectors we used the Mal-

let library of Java classes. This process resulted in

about 720.000 feature vectors of length 10.000, con-

taining the normalized frequencies of words in the

news. There exist two distinct market trends - up and

down. Therefore we chose two classes representing

these trends and named them appropriately UP and

DOWN. We used the class UNKNOWN for not yet

classiﬁed news. This distinguishes news classiﬁed in

the trainig from news to be classiﬁed in the test.

To produce training data we used 75% of news of

each week (totally 391.426 vectors). The other 25%

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

332

of news (totally 326.123 vectors) have been used for

testing. The data preparation took altogether about 3

hours of processing time.

5.2 Building a Classiﬁcation Mode and

its Performancel

After the data had been prepared we started to build

the classiﬁcation model of news using three different

implementations of SVMs.

It was necessary to experiment with different pa-

rameters. Our starting point is displayed in Tables 1,

2, 3. We adjusted those parameters until we obtained

the parameters given in Tables 4, 5, 6.

Table 1: Parameter tuning - 1.

SVMlight GPDT

kernel linear linear

q 10 4.000

n 10 1.332

c 31,2436 10

tolerance 0,001 0,001

iterations 183.428 277

solver 142.246

kernel eval. 14.707.934 1.228.220.040

CPU time 1.297,06 2.878,78

prepar. 755,80

comput. 2047,73

update 27,47

actual time 3h 16min 49min

vectors 235.083 246.455

Table 2: Parameter tuning - 2.

PGPDT

kernel linear

q 4.000

n 1.332

c 10

tolerance 0,001

iterations 278

solver iterations 132.489

kernel evaluations 2.617.571.750

CPU-run time (secs) 2.278,02

therefrom preparation 856,38

therefrom computation 1.189,63

therefrom updating 147,95

actual run time 41min

vectors 246.455

The ﬁrst column shows the run time using the

standard parameters of SVMlight. We experimented

with GPDT/PGPDT parameters necessary for larger

subproblems. We tried using 4000 algorithm intern

Table 3: Parameter tuning - 3.

GPDT PGPDT

kernel nonlinear nonlinear

q 400 400

n 132 132

c 10 10

tolerance 0,001 0,001

iterations 5.093 5101

solver 172.407 890982

kernel eval. 719.387.854.363 23.896.255.511

CPU time 172.406,560 6845,6

prepar. 65,79 44,89

comp. 462.190 105,32

update 171.473.030 6304,92

run time 47h 54min 1h 55min

vectors 239.213 239.199

Table 4: Parameters ﬁnally obtained - 1.

SVMlight GPDT

kernel linear linear

q 10 2.000

n 10 666

c 10 10

tolerance 0,001 0,001

iterations 78.100 585

solver 176.986

kernel eval. 8.948.056 649.692.729

CPU time 1.418,29 1.351,97

prepar. 260,93

comput. 931,59

update 58,88

actual time 2h 49min 24min

vectors 246.137 246.440

variables for the subproblems, i.e. twice as much

as normal. The number of iterations was cut in half

but the number of kernel-evaluations doubled. To-

tally, GPDT needed 49 minutes, and PGPDT needed

41 minutes. In this case, the shorter computing time

dominated over growing overhead. The PGPDT im-

plementation was the fastest.

The ﬁnal parameters and run times achieved can

be seen in Table 2. The parameter q describes the

number of algorithm intern variables in a subproblem,

n is the number of variables replaced after each itera-

tion and c is the value with which the false classiﬁed

vectors will be penalized. The parameter c was com-

puted automatically by

from training data.

The listed values in Tables 4, 5, 6 represent the

best parameter combinations we found. For subprob-

lems with 2.000 variables 585 iterations are sufﬁcient.

SVMlight took 78.100 iterations for only 10 vari-

ables. Each of these two experiments needed between

A CASE STUDY - Classification of Stock Exchange News by Support Vector Machines

333

Table 5: Parameters ﬁnally obtained - 2.

PGPDT

kernel linear

q 2.000

n 666

c 10

tolerance 0,001

iterations 601

solver iterations 188.373

kernel evaluations 1.412.483.880

CPU-run time (secs) 363,28

therefrom preparation 43,33

therefrom computation 217,75

therefrom updating 57,12

actual run time 6min

vectors 246442

Table 6: Parameters ﬁnally obtained - 3.

GPDT PGPDT

kernel nonlinear nonlinear

q 400 400

n 132 132

c 10 10

tolerance 0,001 0,001

iterations 5.093 5.101

solver 172.407 890982

kernel eval. 719.387.854.363 23.896.255.511

CPU time 172.406,560 6845,6

prepar. 65,790 44,89

comput. 462.190 105,32

update 171.473.030 6.304,92

actual time 47h 54min 1h 55min

vectors 239.213 239.199

22 and 25 minutes of CPU time. The rest of the time

was used for the input of training data and the output

of model data. The nonlinear sequential classiﬁcation

using GPDT took 47 hours and 54 minutes and there-

fore much more time than the linear classiﬁcations.

Changing q, n and c had signiﬁcant impact on the lin-

ear runs. Nonlinear runtime, especially the sequential

nonlinear run didn’t improve at all by tinkering with

the parameters.

As we have found, the parallel implementation

PGPDT was used in published applications that are

relatively small compared to our data. For example

in (Zanni et al., 2006), it was tested using 10000 sam-

ples subset of the MNIST handwritten digits database,

constituted by 5000 samples of the digit ”8” and 5000

samples of the other digits. Compared with that our

data containing 717.549 documents is very large.

The PGPDT implementation made a huge impact

on runtime, especially combined with a nonlinear ker-

Figure 1: Comparison between linear and nonlinear classi-

ﬁcation.

nel. PGPDT ran on a 8 CPU-core Apple MacPro us-

ing OpenMPI and 8 processes. The runtime of the

linear kernel run was cut down from 24 minutes to 6

minutes. The extremely long 47 hour run time of the

nonlinear run was cut down to a comparatively small

2 hours. The number of kernel evaluations sunk from

719 billion to 23 billion. The update time sank by

171.466.726 seconds to 6.305 seconds. Parallel pro-

cessing boosted the runtime to acceptable levels.

6 THE CLASSIFICATION

OBTAINED

After the model was built, the next step was to clas-

sify the as of yet unclassiﬁed news, i.e. the news

that hadn’t been used for training. This was done

using the svn classify program, part of Joachim’s

svm light package. The classiﬁcation of the linear

model’s 326.123 vectors took about 3 minutes. The

nonlinear model classiﬁcation took about 5 hours.

The linear classiﬁer classiﬁed 80% of the news cor-

rectly. The nonlinear classiﬁer brought almost iden-

tical results and classiﬁed 81% correctly. Classi-

ﬁed news have their time stamp, i.e. they belong to

a given week, and therefore the quotient Positive −

news/Negative − news can be presented as a time se-

ries and placed into the same graph as the market

trend of German stock exchange index DAX. The

forecast done in this way took about 2 minutes. The

result can be seen in Fig. 1.

As described above, we investigated the hypothe-

sis about a dependency between a quotient of positive

and negative news and market trend. The possibility

stated above is that the positive news have majority

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

334

during upward trends. The question is whether a ma-

jority of positive news will be reached before the mar-

ket trend starts to go up. In such a case there would

be perhaps a small forecast possibility.

The Fig. 1 shows both linear and nonlinear prog-

nosis as well as the DAX values. The linear and non-

linear classiﬁcations are almost identical. Theoreti-

cally, a nonlinear SVM should classify better (Zanni

and Zanghirati, 2003). However, this was not the

case since our dataset was biased towards the DOWN

class. Therefore, the classiﬁer classiﬁes most of the

DOWN news correctly, since they are more promi-

nent.

During the time when the trend stays stable, i.e.

DOWN-phase 2000, about 90% - 95% of news are

correctly classiﬁed. When the trend becomes unsta-

ble, the precision of classiﬁcation is going down to

40%, i.e. in UP-phase 2001, DOWN-phase 2004,

DOWN-phase 2006.

We can see that the news trend changed from

DOWN to UP with July 2002 but the market trend

changed from DOWN to UP with a delay in the 11.

week of 2003. In other break points, this phenomenon

cannot be seen so clearly but we can see the change of

the news trend from UP to DOWN at the end of 2006,

i.e. some months before the subprime crisis started to

break the long-term market trend.

This can be explained by investor psychology.

During the enthusiastic UP market trend, investors do

not want to accept the coming negative news and be-

cause of that the market trend changes with a delay.

Similarly, during a DOWN trend, investors interpret

the positive news with too much pessimism and be-

cause of that the market trend changes again, albeit

with a delay.

Unfortunately, we haven’t got enough data to pos-

tulate some new laws based on our investigations. In

such a way, we can investigate only changes of big

trends and they are rare. We haven’t got news in elec-

tronical form before 1999.

7 CONCLUSIONS

In this paper, we focused on the classiﬁcation of large

sets of textual news by Support Vector Machine. In

technical terms, the SVM worked relatively well with

succesful classiﬁcation rates of about 80%. It’s also

noteable that with 6 min during linear-parallel classi-

ﬁcation, the classiﬁcation time was relatively low. As

described above, the nonlinear approach should have

yielded a better result (Zanni and Zanghirati, 2003).

Since the runtime of 6 minutes using linear PGPDT

is a lot smaller than 2 hours on nonlinear GPDT and

they differ only in about 0,8% classiﬁcation quality,

the linear-parallel approach proved to be the best in

our experiments.

It is very interesting to note that PGPDT and

8 process-OpenMPI cut the runtime of the nonlin-

ear runs down from the extremely long runtime of

48hours to 0,04% of that - 2 hours. Also, paral-

lel solving greatly improved linear runtime as well.

Therefore we advise to make full use of the parallel

implementation, the differences in speed we discov-

ered were immense.

We found that SVM is a viable tool to classify

large sets of stock market messages. However, the

market processes are known to be chaotic (Peters,

1996). This means that any prediction and any fore-

cast mechanism is on principle questionable. We

could argue from our experimentation that SVMs are

a viable means of forecasting large movements in the

stock market. Also, we explained above the possi-

ble psychologic reasons of the delayed reaction of in-

vestors on changed news. But since our dataset is very

limited (our set of news in electronic form starts at the

end of 1999) and long-time stock market data is hard

to come by, no conclusive statement about the fore-

cast quality can be reached in a mathematically cor-

rect way. But for all that, we ﬁnd the news classiﬁ-

cation results and their correlation with the long-term

market trends interesting.

In the further work, we will try to support our

hypothesis by using more sophisticated classiﬁcation

methods. Also, we will look for correlation with other

market indices.

REFERENCES

Boser, B., Guyon, I., and Vapnik, V. (1992). A training algo-

rithm for optimal margin classiﬁers. In In: Haussler,

D. (Ed.): Proceedings of 5th Annual ACM Workshop

on COLT, pages 144-152. Pittsburgh, PA, ACM Press.

Braverman, E., Aizerman, M., and Rozonoer, L. (1964).

Theoretical foundations of the potential function

methodin pattern recognition learning. In Automation

and Remote Control 25, pp. 821 - 837.

Janetzko, R. (2008). Using Support Vector Machines for

Classiﬁcation of News. TU Chemnitz, (In German).

Joachims, T. (1998a). Making large-scale svm learning

practical. In LS8-Report, 24. Universitaet Dortmund.

Joachims, T. (1998b). Text categorization with sup-

port vector machines: Learning with many rele-

vant features. In http://www.cs.cornell.edu/ Peo-

ple/tj/publications/joachims98a.

Joachims, T. (2001). Learning to classify text using Support

Vector Machines. Kluwer Academic Publishers.

A CASE STUDY - Classification of Stock Exchange News by Support Vector Machines

335

Kroeber, K. (2010). Classiﬁcation of Stock Market Mes-

sages Using Nonlinear Support Vector Machines. TU

Chemnitz.

Kroha, P., Baeza-Yates, R., and Krellner, B. (2006).

Text mining of business news for forecasting. In

In: Proceedings of 17th International Conference

DEXA’2006, Workshop on Theory and Applications of

Knowledge Management TAKMA’2006, pp. 171-175.

IEEE Computer Society.

Kroha, P. and Reichel, T. (2007). Using grammars for text

classiﬁcation. In In: Cardoso, J., Cordeiro, J., Filipe,

J.(Eds.): Proceedings of the 9th International Confer-

ence on Enterprise Information Systems ICEIS’2007,

Volume Artiﬁcial Intelligence and Decision Support

Systems, pp. 259-264. INSTICC with ACM SIGMIS

and AAAI.

Kroha, P., Reichel, T., and Krellner, B. (2007). Text mining

for indication of changes in long-term market trends.

In In: Tochtermann, K., Maurer, H. (Eds.): Proceed-

ings of I-KNOW’07 7th International Conference on

Knowledge Management as part of TRIPLE-I 2007,

Journal of Universal Computer Science, pp. 424-431.

Peters, E. (1996). Chaos and Order in the Capital Markets.

John Wiley.

Seraﬁni, T., Zanghirati, G., and Zanni, L. (2005). Gradient

projection methods for large quadratic programs and

applications in training support vector machines. In

Optim. Meth. Soft., 20, pp. 353-378.

Vapnik, V. (1995). The Nature of Statistical Learnig Theory.

Springer.

Zanni, L., Seraﬁni, T., and Zanghirati, G. (2006). Parallel

software for training large scale support vector ma-

chines on multiprocessor systems. In Journal of Ma-

chine Learning Research, Volume 7.

Zanni, L. and Zanghirati, G. (2003). A parallel solver for

large quadratic programs in training support vector

machines. In Parallel Computing, 29, pp. 535 - 551.

ICSOFT 2010 - 5th International Conference on Software and Data Technologies

336