Experiences with the LVQ Algorithm in Multilabel Text

Categorization

A. Montejo-R

aez, M. T. Mart

ın-Valdivia and L. A. Ure

na-L

opez

en University, Ja

en E-23071 Spain - Departamento de Inform

atica

Abstract. Text Categorization is an important information processing task. This

paper presents a neural approach to a text classiﬁer based on the Learning Vector

Quantization (LVQ) algorithm. We focus on multilabel multiclass text catego-

rization. Experiments were carried out using the High Energy Physics (HEP) text

collection. The HEP collection is an highly unbalanced collection. The results ob-

tained are very promising and show that our neural approach based on the LVQ

algorithm behaves robustly over different parameters.

1 Introduction

Text Categorization (TC) is an important task for many Natural Language Processing

(NLP) applications. Given a set of documents and a set of categories, the goal of a cat-

egorization system is to assign to each document a set (possibly empty) of categories

that the document belongs to. The simplest case includes only one class and the cat-

egorization problem is a decision problem or binary categorization problem (given a

document, the goal is to determinate if the document is ‘relevant’ or ‘not relevant’).

On the other hand, the single-label categorization problem consists on assigning

exactly one category to each document while the multi-label categorization problem

assigns from 0 to m categories to the same document. For a deeper and exhaustive

discussion on different text categorization problem see [1].

This work studies the multi-label categorization problem using the Learning Vec-

tor Quantization (LVQ) algorithm. The LVQ algorithm is a competitive neural learn-

ing algorithm that allows assign a category (from a set of categories) to a document

(single-label problem). In order to accomplish the multi-label categorization problem

we have used the LVQ algorithm as a binary classiﬁer integrated in the TECAT Toolkit

[2]. TECAT stands for TExt CATegorization. It is a tool for creating multi-label auto-

matic text classiﬁers. With TECAT you can experiment with different collections and

classiﬁers in order to build a multi-labeled automatic text classiﬁer and implements the

Adaptive Selection of Base Classiﬁers (ASBC) as approach to the problem.

The paper is organized as follows. First, the HEP collection is introduced. Then,

we describe brieﬂy the TECAT Toolkit as a tool to solve the multi-label automatic text

classiﬁcation problem. Then, we describe the LVQ algorithm and the integration on

the TECAT in order to accomplish our experiments. After this, we show our evaluation

environment and results obtained. Finally, we discuss our conclusion and future work.

Montejo-Ráez A., T. Martín-Valdivia M. and A. Ureña-López L. (2007).

Experiences with the LVQ Algorithm in Multilabel Text Categorization.

In Proceedings of the 4th International Workshop on Natural Language Processing and Cognitive Science, pages 213-221

DOI: 10.5220/0002419302130221

 SciTePress

2 The HEP Collection

The HEP corpus

is a collection of papers related to High Energy Physics, and manually

indexed with DESY labels. These documents have been compiled by Montejo-R

aez and

Jens Vigen from the CERN

Document Server

and have motivated intensive study of

text categorization systems in recent years [3], [4], [5].

Table 1. The ten most frequent main key words in the hep-ex partition.

No. docs. Keyword

1898 (67%) electron positron

1739 (62%) experimental results

1478 (52%) magnetic detector

1190 (42%) quark

1113 (39%) talk

715 (25%) Z0

676 (24%) anti-p p

551 (19%) neutrino

463 (16%) W

458 (16%) jet

In our experiments we have used the meta-data from the hep-ex partition of the HEP

collection, composed by 2,839 abstracts related to experimental High-Energy Physics

that are indexed with 1,093 main keywords, with an average number of classes per

document slightly above 11. This partition is highly unbalanced: only 84 classes are

represented by more than 100 samples and only ﬁve classes by more than 1,000. The

uneven use is particularly noticeable for the ten most frequent key words: In Table 1

the left column shows the number of positive samples of a key word and the percentage

over the total of samples in the collection.

2.1 Evaluation Measures

The effectiveness of a classiﬁer can be evaluated with several known measures [1]. The

classical “Precision” and “Recall” for Information Retrieval are adapted to the case

of Automatic Text Categorization. To that end, we must ﬁrst determine the basis the

system will be evaluated in: categories or documents. Traditionally contingency table

for each category should be generated (Table 2), and then the precision and recall for

each category are calculated following equations 1 and 2. But in a multi-label scenario

also an equivalent contingency table can be computed, where i in this case represents

a document instead of a class. For this evaluation to be useful, the average number of

classes per document should be relevant. This is the case for the hep-ex partition, where

an average number of 15 classes are assigned to a single document. Also, this last

The collection is freely available for academic purposes from:

http://sinai.ujaen.es/wiki/index.php/HepCorpus

CERN is the European Laboratory for Particle Physics, located in Geneva (Switzerland)

http://cds.cern.ch

214

base of evaluation serves better for the purpose of really test the effectiveness of a label

system in real environments, where an user expects a list of classes to a given item. This

difference is only for macro-averaged values, because micro-averaged measurements

remain the same for both evaluation strategies.

Table 2. Contingency Table for i Category.

YES is correct NO is correct

YES is assigned a

NO is assigned c

+ b

(1)

+ c

(2)

On the other hand, the precision and recall can be combined using the F

measure:

(R, P) =

2PR

P+ R

(3)

As we have pointed out, in order to measure globally the average performance of

a classiﬁer, three measures can be used: micro-averaged precision P

, macro-averaged

precision in a document basis P

macro−d

and macro-averaged precision in a category

basis P

macro−c

∑

i=1

∑

i=1

+ c

)

(4)

macro

∑

i=1

(5)

where K is the number of categories or the number of documents depending on the

basis used.

Recall and F1 measures are computed in a similar way. In our experiments we have

used these measures in order to prove the effectiveness of the studied system.

3 The Adaptive Selection of Base Classiﬁers in TECAT

We developed the TECAT tool for automatic text classiﬁcation of HEP documents

stored at the CERN digital library [6]. It implements the Adaptive Selection of Base

Classiﬁers (ASBC) algorithm as classiﬁcation strategy (shown in Figure 1), which

trains and selects binary classiﬁers for each class independently. In this algorithm any

type of binary classiﬁer can be used, even a set of heterogeneous binary classiﬁers, like

SVM [7], Rocchio [8] and others at the same time, but (a) it allows adjusting the binary

215

Input:

a set of training documents D

a set of validation documents D

a threshold α on the evaluation measure

a set of possible label (classes) L

a set of candidate binary classiﬁers C

Output :

a set C

′

= {c

, ..., c

|L|

} of trained

binary classiﬁers

Pseudo code:

′

←

for-each l

in L do

T ←

for-each c

in C do

train-classiﬁer(c

, l

, D

)

T ← T ∪ {c

}

end-for-each

best

← best-classiﬁer(T, D

)

if evaluate-classiﬁer(c

best

) > α

′

← C

′

∪ {c

best

}

end-if

end-for-each

Fig.1. The one-against-all learning algorithm with classiﬁer ﬁltering.

classiﬁer for a given class by a balance factor, and (b) it gives the possibility of choosing

the best of a given set of binary classiﬁers.

The algorithm introduces the α parameter, resulting in the algorithm given in Fig-

ure 1. This value is a threshold for the minimum performance allowed to a binary clas-

siﬁer during the validation phase in the learning process. If the performance of a certain

classiﬁer is below the value α, meaning that the classiﬁer performs badly, classiﬁer and

the class are discarded. The effect is similar to that of the SCutFBR [9], i.e. we never

attempt to return a positive answer for rare classes. This algorithm has been found to

outperform well known approaches like Ada-Boost [10] when applied on HEP data [3].

4 LVQ Integration into TECAT

The Learning Vector Quantization (LVQ) algorithm [11] has been successfully used in

several applications such as pattern recognition, speech analysis, etc. This work pro-

poses the use of the LVQ algorithm to accomplish the multi-label text categorization

problem. However, the LVQ is used only as a binary classiﬁer integrated in the TECAT

system.

216

4.1 The LVQ Algorithm

The LVQ algorithm is a classiﬁcation method, which allows the deﬁnition of a group of

categories on the space of input data by reinforced learning, either positive (reward) or

negative (punishment). The LVQ uses supervised learning to deﬁne class regions on the

input data space. Weight vectors associated to each output unit are known as codebook

vectors. Each class of input space is represented by its own set of codebook vectors.

Codebook vectors are deﬁned according to the speciﬁc task.

The basic LVQ algorithm is quite simple. It starts with a set of input vectors x

and weight vectors w

which represent the classes to learn. In each iteration, an input

vector x

is selected and the vectors w

are updated, so that they ﬁt x

better. The LVQ

algorithm works as follows:

In each repetition, the algorithm selects an input vector, x

, and compares it with

every weight vector, w

, using the Euclidean distance kx

− w

k, so that the winner will

be the codebook vectors w

closest to x

in the input space for this distance function.

The determination of c is achieved by following decision process:

− w

k = min

− w

k (6)

i.e.,

c = argmin

− w

k (7)

The LVQ algorithm is a competitive network, and thus, for each training vector,

output units compete among themselves in order to ﬁnd the winner according to some

metric. The LVQ algorithm uses the Euclidean distance to ﬁnd the winner unit. Only

the winner unit (i.e., the output unit with the smallest Euclidean distance with regard to

the input vector) will modify its weights using the LVQ learning rule. Let x

(t) be an

input vector at time t, and w

(t) represent the weight vector for the class k at time t. The

following equations deﬁne the basic learning process for the LVQ algorithm:

(t + 1) = w

(t) + β(t)[x

(t) − w

(t)]

if x

and w

belong to the same class

(t + 1) = w

(t) − β(t)[x

(t) − w

(t)] (8)

if x

and w

belong to different class

(t + 1) = w

(t) if k 6= c

where β(t) is the learning rate, which decreases with the number of iterations of training

(0 < β(t) << 1). It is recommended that β(t) be rather small initially, say, smaller than

0.3, and that it decrease to a given threshold, ν, very close to 0 [11]. In our experiments,

we have initialized β(t) to 0.1.

4.2 Using LVQ with TECAT

The LVQ algorithm has been integrated into TECAT as a binary classiﬁer. For this, LVQ

has been used to work with only two possible classes: positive or negative. Thus, we

217

can pass as parameters to the algorithm the number of codebook vectors that we want

to associate to each of this two categories. TECAT tries to train a LVQ classiﬁer for

each class, discarding both class and classiﬁer if a minimal value of performance is not

reached on a validation subset (the measure used is F1 and the threshold α on it is set

to 0.1). Therefore, instead of having a set of codebook vectors for each class, we will

have a binary LVQ version of the classiﬁer for each class.

4.3 Adjusting the Codebook Number per Class

The HEP corpus is highly unbalanced, therefore, when learning on a binary space, the

number negative samples clearly would surpass that of the positive ones. Following

this reasoning, we could foresee a beneﬁt on the classiﬁer adaptability if the number of

codebook vectors associated to each of the two sides is set according to the imbalance

that the class reﬂects. For this reason, we have carried out different experiments to

adjust the number of codebook vectors to every single class.

In order to study the effects of the number of codebooks assigned to each class we

have accomplished several experiments with a varying number of codebook vectors per

class. On the one hand, we have used the same number of codebooks per class. In this

case we have accomplished three different experiments with ncb0 = ncb1 = {1, 2, 5}

where ncb0 is the number of codebooks assigned to class 0 and ncb1 is the number of

codebooks assigned to class 1. These two classes represents the binary decision to take

when assigning each class to a document.

On the other hand, we have assigned different number of codebooks per class de-

pending on the number of negative and positive training samples. Thus, we have as-

signed 1, 2 and 5 codebooks to the class 1, i.e., ncb1 = {1, 2, 5}. However, to calculate

the number of codebooks assigned to class 0, we have used the following formula:

ncb0 = ncb1+ log

(

nneg

npos

) (9)

where npos and nneg are the number of training samples positives and negatives respec-

tively. This fraction would be the inverse if the number of positive samples is larger than

the number of negative ones (which occurs only for one class in the collection used).

For example, if the class had 35 positive samples, and 1671 negative ones, and the value

for ncb1 were 2, then 6 codebook vectors would be trained to cover the negative side of

the decision space.

5 Results

Depending on the number of codebook vectors assigned to each class we have accom-

plished six different experiments: three experiments with the same number of code-

books (ncbv1, ncbv2 and ncbv5) and three with different number of codebook (norm-

cbv1, normcbv2 and normcbv5). Results for the 10 most frequent classes are shown in

Table 3 and Figure 2. Tables 4, 5 and 6 show results of macro-averaging and micro-

averaging for the different experiments.

218

Table 3. Results summary for averaged measures on 10 most frequent classes.

Precision Recall F1 Experiment

0.744244 0.570907 0.624045 ncbv1

0.757046 0.613313 0.662224 ncbv2

0.771196 0.617652 0.674908 ncbv5

0.768844 0.563419 0.631547 normcbv1

0.780134 0.611680 0.669962 normcbv2

0.773607 0.614498 0.674848 normcbv5

Table 4. Results summary on macro-averaged values in a document basis.

Precision Recall F1 Experiment

0.627956 0.404897 0.455048 ncbv1

0.627246 0.450413 0.490741 ncbv2

0.639530 0.447763 0.495636 ncbv5

0.652167 0.378170 0.448980 normcbv1

0.655089 0.406707 0.471558 normcbv2

0.645321 0.397321 0.461638 normcbv5

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 1 2 3 4 5 6 7 8 9

precision

top ranked

ncbv1

ncbv2

ncbv5

normcbv1

normcbv2

normcbv5

Fig.2. F1 measurements for different conﬁgurations on the 10 most frequent classes.

The main remark is that using an uneven number of codebook vectors on positive

and negative sides depending on class imbalance does not yields to better results in

general. This is maybe due to a lack of positive samples in most classes. The decision

on whether to use a different number of positive and negative codebook vectors should

be restricted to scenarios where classes are well represented. This reasoning can be

Table 5. Results summary on macro-averaged values in a class basis.

Precision Recall F1 Experiment

0.500383 0.320962 0.362223 ncbv1

0.493239 0.327211 0.369442 ncbv2

0.627357 0.386476 0.458187 ncbv5

0.514602 0.274930 0.333245 normcbv1

0.517373 0.294911 0.352776 normcbv2

0.487381 0.262504 0.322848 normcbv5

219

Table 6. Results summary on micro-averaged values.

Precision Recall F1 Experiment

0.656294 0.415546 0.508883 ncbv1

0.656267 0.438942 0.526042 ncbv2

0.716130 0.481063 0.575519 ncbv5

0.710973 0.386372 0.500663 normcbv1

0.701892 0.415265 0.521809 normcbv2

0.710198 0.409406 0.519397 normcbv5

check in Table 3. Here, for most frequent classes, the balancing of codebook vectors

improves the obtained results.

Nevertheless, a binary version of the LVQ algorithm stands as a valid solution to

this multi-label problem. As can be seen, no major variations in evaluation measures

are noticed, despite different conﬁgurations, so the algorithm is robust in this sense.

About the number of codebook vectors, it seems that a higher number of them tends

to improve the overall classiﬁer performance. Actually, the ncbv5 experiment is the one

with the highest measured F1 in all averaging strategies.

6 Conclusions

A neural algorithm for multi-label has been studied. This algorithm is based in the LVQ

learning method, integrating it into the ASBC approach. Some experiments have been

carried out on a high unbalanced collection: the hep-ex partition of the HEP corpus. The

results obtained show that, despite the complexity of the collection, the robustness of

the algorithm remains against different conﬁguration parameters.

As future work we plan to apply this algorithm on a more comparable collection

in the text-categorization domain, speciﬁcally Reuters-21578. For this collection, re-

sults on applying LVQ with a codebook vector per class are available [12], which re-

ported 0,61 macro-averaged precision and 0,73 micro-averaged precision. Also, since

the ASBC algorithm allows to select among a set of possible classiﬁers, we will test our

approach with a wide range of parameterizations of the LVQ algorithm on every class,

letting the system decide which parameters are the best of each class.

Acknowledgements

This work has been partially supported by a grant from the Spanish Government, project

TIMOM (TIN2006-15265-C06-03), and a grant from the University of Ja

en, project

RFC/PP2006/Id

514.

References

1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34

(2002) 1–47

220

2. Montejo-R

aez, A.: Automatic Text Categorization of Documents in the High Energy Physics

Domain. PhD thesis, University of Granada (2006)

3. Montejo-R

aez, A., Ure

na L

opez, L.: Binary classiﬁers versus adaboost for labeling of digital

documents. Sociedad Espa

nola para el Procesamiento del Lenguaje Natural (2006) 319–326

4. Montejo-R

aez, A., Ure

na L

opez, L.: Selection strategies for multi-label text categorization.

Lecture Notes in Artiﬁcial Intelligence (2006) 585–592

5. Vassilevskaya, L.A.: An approach to automatic indexing of scientiﬁc publications in high

energy physics for database spires-hep. Master’s thesis, Fachhochsule Potsdam, Institut fr

Information und Dokumentation (2002)

6. Montejo-R

aez, A., Steinberger, R., Ure

na L

opez, L.A.: Adaptive selection of base classiﬁers

in one-against-all learning for large multi-labeled collections. In et al., V.J.L., ed.: Advances

in Natural Language Processing: 4th International Conference, EsTAL 2004. Number 3230

in Lectures notes in artiﬁal intelligence, Springer (2004) 1–12

7. Joachims, T.: Text categorization with support vector machines: learning with many relevant

features. In N

edellec, C., Rouveirol, C., eds.: Proceedings of ECML-98, 10th European Con-

ference on Machine Learning. Number 1398, Chemnitz, DE, Springer Verlag, Heidelberg,

DE (1998) 137–142

8. Lewis, D.D., Schapire, R.E., Callan, J.P., Papka, R.: Training algorithms for linear text clas-

siﬁers. In Frei, H.P., Harman, D., Sch

auble, P., Wilkinson, R., eds.: Proceedings of SIGIR-96,

19th ACM International Conference on Research and Development in Information Retrieval,

urich, CH, ACM Press, New York, US (1996) 298–306

9. Yang, Y.: A study on thresholding strategies for text categorization. In Croft, W.B., Harper,

D.J., Kraft, D.H., Zobel, J., eds.: Proceedings of SIGIR-01, 24th ACM International Confer-

ence on Research and Development in Information Retrieval, New Orleans, US, ACM Press,

New York, US (2001) 137–145 Describes RCut, Scut, etc.

10. Schapire, R.E., Singer, Y.: BoosTexter: A boosting-based system for text categorization.

Machine Learning 39 (2000) 135–168

11. Kohonen, T.: Self-organization and associative memory. 2 edn. Springer-Verlag (1995)

12. Mart

ın-Valdivia, M., Garc

ıa-Vega, M., Garc

ıa-Cumbreras, M., Ure

na L

opez, L.: Text cat-

egorization using the learning vector quantization algorithm. In: Proceedings of Intelligent

Information Systems. New Trends in Intelligent Information Processing and Web Mining

(IIS:IIPWM-04), Zakopane, Poland, Springer-Verlag (2004)

221