A Study on Term Weighting for Text Categorization: A Novel Supervised

Variant of tf.idf

Giacomo Domeniconi

, Gianluca Moro

, Roberto Pasolini

and Claudio Sartori

DISI, Universit

a Degli Studi Di Bologna, Via Venezia 52, 47523 Cesena, Italy

DISI, Universit

a Degli Studi Di Bologna, Viale del Risorgimento 2, 40136 Bologna, Italy

Keywords:

Term Weighting, Supervised Term Weighting Scheme, Text Categorization, tﬁdf, Text Representation.

Abstract:

Within text categorization and other data mining tasks, the use of suitable methods for term weighting can

bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout

literature, based on assumptions commonly derived from observation of distribution of words in documents.

For example, the idf assumption states that words appearing in many documents are usually not as important

as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval,

schemes proposed more recently are supervised, i.e. based on knownledge of membership of training doc-

uments to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the

usual idf factor without considering documents of the category to be recognized, so that importance of terms

frequently appearing only within it is not underestimated. A further proposed variant is additionally based on

relevance frequency, considering occurrences of words within the category itself. In extensive experiments on

two recurring text collections with several unsupervised and supervised weighting schemes, we show that the

ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two

different learning methods.

1 INTRODUCTION

Text categorization (or text classiﬁcation) is the task

of automatically labeling natural language text docu-

ments with predeﬁned categories of topics. One of the

major problems in working with text is the intrinsic

unstructured form of documents: a suitable derived

representation must be used in an automated classiﬁ-

cation process.

The most widely used approach for the struc-

tured representation of textual data is the Vector Space

Model (VSM), where each document is represented

by a vector in a high-dimensional space, also known

as bag of words. Each dimension of this space rep-

resents a word, or term equivalently. It is therefore

necessary a data preprocessing phase to extract words

from documents and to assign them weights accord-

ing to their importance in each document. These

weights are assigned according to a chosen term

weighting scheme.

Different possible term weighting schemes have

been developed throughout the literature. The choice

of a suitable scheme can signiﬁcantly affect the effec-

tiveness of the classiﬁcation. For example, (Leopold

and Kindermann, 2002) thoroughly test different

weighting schemes using SVM classiﬁers, experienc-

ing substantial gaps between accuracies with different

schemes. The choice of the term weighting scheme is

important not only for text classiﬁcation, but also for

other new types of text mining tasks, such as senti-

ment analysis (Deng et al., 2014; Paltoglou and Thel-

wall, 2010), cross-domain classiﬁcation (Domeniconi

et al., 2014) and novelty mining (Tsai and Kwee,

2011). From scientiﬁc research, the use of term

weighting schemes has moved to practical applica-

tions, with its employment in projects of major IT en-

terprises such as Yahoo! (Carmel et al., 2014) and

IBM (Papineni, 2001).

A typical term weighting scheme is the compo-

sition of a local and a global factor: while the for-

mer indicates the importance of each term within each

document regardless of the other documents, the lat-

ter weights the discriminating power of each term

throughout the whole collection of documents. Once

computed, weights are also often normalized to pre-

vent bias to longer documents.

Many studies in the literature are focused on the

global factor. Methods to compute it fall into two

Domeniconi G., Moro G., Pasolini R. and Sartori C..

A Study on Term Weighting for Text Categorization: A Novel Supervised Variant of tf.idf.

DOI: 10.5220/0005511900260037

In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA-2015), pages 26-37

ISBN: 978-989-758-103-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

types: supervised methods leverage known informa-

tion about the membership of training documents to

categories, while unsupervised methods do not use

this information and are only based on distribution of

terms across documents. Unsupervised term weight-

ing schemes are generally borrowed from informa-

tion retrieval (IR); the most widely used is tf.idf, pro-

posed by (Sparck Jones, 1988) and justiﬁed, in the

IR ﬁeld, by (Robertson, 2004). Supervised weight-

ing schemes, naturally employable in text categoriza-

tion, have been proposed in more recent times: they

are usually based on distribution of terms across cat-

egories; one possibility is to use techniques usually

employed for feature selection (Debole and Sebas-

tiani, 2003).

We propose here a supervised variant of the tf.idf

scheme. The intuition behind the basic idf factor is

that terms appearing frequently throughout the col-

lection have minor discriminating power. This is not

always true in text classiﬁcation, because if a term

occurs in a high number of documents belonging to

the same category, then we can assert that the term

is very effective in discriminating that category from

the others. From this intuition, we in practice com-

pute the idf factor considering only documents out-

side of the category under consideration, in order to

prevent the frequent appearance of a term in the same

category from underestimating its importance. In a

second variant that we propose, we combine the intu-

ition above with the relevance frequency factor of tf.rf

(Lan et al., 2009), in order to also positively consider

the appearance of a term in documents of the category

under analysis.

The paper is organized as follows. Section 2

presents related works on term weighting methods,

focusing on unsupervised and supervised methods for

global weighting. In Section 3, we present and mo-

tivate the two proposed variants of idf. Section 4

presents the general setup of our experimental eval-

uation, whose results are presented and discussed in

Section 5. Conclusive remarks are given in Section 6.

2 RELATED WORK

The problem of text categorization has been exten-

sively investigated in the past years, considering the

ever-increasing applications of this discipline, such

as news or e-mail ﬁltering and organization, indexing

and semantic search of documents, sentiment analy-

sis and opinion mining, prediction of genetic diseases,

etc. (Sebastiani, 2002)

In the machine learning approach, a knowl-

edge model to classify documents within a set C =

,...,c

|C |

} of categories is built upon a train-

ing set D

= {d

,...,d

} of documents with a

known labeling L : D

× C → {0, 1} (L(d, c) = 1 if

and only if document d is labeled with c). In order to

leverage standard machine learning algorithms, doc-

uments are generally pre-processed to be represented

in a Vector Space Model (VSM).

In the VSM, the content of a document d

is rep-

resented as a vector w

= {w

,...,w

} in a n-

dimensional vector space R

, where w

is a weight

that indicates the importance of a term t

in d

. Terms

,...,t

constitute a set of features, shared across

all documents. In other words, each weight w

indi-

cates how much the term t

contributes to the semantic

content of d

Weights for each term-document couple are as-

signed according to a predeﬁned term weighting

scheme, which must meaningfully estimate the im-

portance of each term within each document.

Three are the considerations discussed in the years

regarding the correct assignment of weights in text

categorization (Debole and Sebastiani, 2003):

1. the multiple occurrence of a term in a document

appears to be related to the content of the docu-

ment itself (term frequency factor);

2. terms uncommon throughout a collection better

discriminate the content of the documents (collec-

tion frequency factor);

3. long documents are not more important than the

short ones, normalization is used to equalize the

length of documents.

Referring to these considerations, most term

weighting schemes can be broken into a local (term

frequency) factor and a global (collection frequency)

factor. Normalization is applied on a per-document

basis after computing these factors for all terms, usu-

ally by means of cosine normalization.

normalized

∑

i=1

)

· w

(1)

There are several ways to calculate the local term

frequency factor, which are summarized in Table 1.

The simplest one is binary weighting, which only con-

siders the presence (1) or absence (0) of a term in a

document, ignoring its frequency. The perhaps most

obvious possibility is the number of occurrencies of

the term in the document, which is often the intended

meaning of “term frequency” (tf ). Other variants have

been proposed, for example the logarithmic tf, com-

puted as log(1 + tf), is now practically the standard

local factor used in literature (Debole and Sebastiani,

2003). Another possible scheme is the inverse term

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf

Table 1: Term frequency factors.

Term Frequency Factor Notation Description

1/0 BIN Presence or absence of terms in the document

term frequency TF Number of times the term occurs in the document

log(1 +t f ) log TF Logarithm of tf

1 −

r+t f

ITF Inverse term frequency, usually r = 1

S(V

) =

(1 − d) + d ·

∑

∈Conn(V

)

S(V

)

|Conn(V

Given a graph G = (V, E), let Conn(V ) be the set of

vertices connected to V . Typical value for d is 0.85.

frequency, proposed by (Leopold and Kindermann,

2002). Another way to assign the term frequency fac-

tor was proposed by (Hassan and Banea, 2006), in-

spired by the PageRank algorithm: they weight terms

using a random walk model applied to a graph en-

coding words and dependencies between them in a

document. Each word of the document is modeled

in the graph as a node and an edge (bidirectional, un-

like PageRank) connects two words if they co-occur

in the document within a certain windows size,. In

this work, we have chosen logarithmic term frequency

(log(1 + tf)) as the local factor for all experiments.

As mentioned earlier, the global collection fre-

quency factor can be supervised or unsupervised, de-

pending whether it leverages or not the knowledge of

membership of documents to categories. In the fol-

lowing, are summarized some of the most used and re-

cent methods proposed in the literature of both types.

2.1 Unsupervised Term Weighting

Methods

Generally, unsupervised term weighting schemes,

not considering category labels of documents, de-

rive from IR research. The most widely unsuper-

vised method used is tf.idf, which (with normaliza-

tion) perfectly embodies the three assumptions previ-

ously seen. The basic idea is that terms appearing in

many documents are not good for discrimination, and

therefore they will weight less than terms occurring

in few documents. Over the years, researchers have

proposed several variations in the way they calculate

and combine the three basic assumptions (tf, idf and

normalization), the result is the now standard variant

“ltc”, where tf(t

) is the tf factor described above

denoting the importance of t

within document d

. In

the following, the form “t

∈ d

” is used to indicate

that term t

appears at least once in document d

t f .id f (t

) = t f (t

) · log



∈ D

: t

∈ d



(2)

The idf factor multiplies the tf for a value that is

greater when the term is rare in the collection of train-

ing documents D

. The weights obtained by the for-

mula above are then normalized according to the third

assumption by means of cosine normalization (Eq. 1).

(Tokunaga and Makoto, 1994) propose an exten-

sion of the idf called Weighted Inverse Document Fre-

quency (widf ), given by dividing the tf(t

) by the

sum of all the frequencies of t

in all the documents of

the collection:

widf(t

) =

∑

∈D

tf(t

)

(3)

(Deisy et al., 2010) propose a combination of

idf and widf, called Modiﬁed Inverse Document Fre-

quency (midf ) that is deﬁned as follows:

midf(t

) =

∈ D

: t

∈ d

∑

∈D

tf(t

)

(4)

Of course the simplest choice, sometimes used, is

to not use a global factor at all, setting it to 1 for all

terms and only considering term frequency.

2.2 Supervised Term Weighting

Methods

Since text categorization is a supervised learning task,

where the knowledge of category labels of train-

ing documents is necessarily available, many term

weighting methods use this information to supervise

the assignment of weights to each term.

A basic example of supervised global factor is in-

verse category frequency (icf ):

icf(t

) = log



|C |

∈ C : t

∈ c



(5)

where “t

∈ c

” denotes that t

appears in at least one

document labeled with c

. The idea of the icf factor

is similar to that of idf, but using the categories in-

stead of the documents: the fewer are the categories

in which a term occurs, the greater is the discriminat-

ing power of the term.

Within text categorization, especially in the multi-

label case where each document can be labeled with

an arbitrary number of categories, it is common to

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

train one binary classiﬁer for each one of the possible

categories. For each category c

, the corresponding

model must separate its positive examples, i.e. doc-

uments actually labeled with c

, from all other doc-

uments, the negative examples. In this case, it is al-

lowed to compute for each term t

a distinct collection

frequency factor for each category c

, used to repre-

sent documents in the VSM only in the context of that

category.

In order to summarize the various methods of su-

pervised term weighting, we show in Table 2 the

fundamental elements usually considered by these

schemes and used in the following formulas to com-

pute the global importance of a term t

for a category

• A denotes the number of documents belonging to

category c

where the term t

occurs at least once;

• C denotes the number of documents not belong-

ing to category c

where the term t

occurs at least

once;

• dually, B denotes the number of documents be-

longing to c

where t

does not occur;

• D denotes the number of documents not belonging

to c

where t

does not occur.

The total number of training documents is denoted

with N = A + B +C + D = |D

Table 2: Fundamental elements of supervised term weight-

ing.

A C

B D

In this notation, the ltc-idf factor is expressed as:

idf = log



A +C



(6)

As suggested by (Debole and Sebastiani, 2003),

an intuitive approach to supervised term weighting is

to employ common techniques for feature selection,

such as χ

, information gain, odds ratio and so on.

(Deng et al., 2004) uses the χ

factor to weigh

terms, replacing the idf factor, and the results show

that the tf.χ

scheme is more effective than tf.idf us-

ing a SVM classiﬁer. Similarly (Debole and Sebas-

tiani, 2003) apply feature selection schemes multi-

plied by the tf factor, by calling them “supervised

term weighting”. In this work they use the same

scheme for feature selection and term weighting, in

contrast to (Deng et al., 2004) where different mea-

sures are used. The results of the two however are in

contradiction: (Debole and Sebastiani, 2003) shows

that the tf.idf always outperforms χ

, and in general

the supervised methods not give substantial improve-

ments compared to unsupervised tf.idf. The widely-

used collection frequency factors χ

, information gain

(ig), odds ratio (or) and mutual information (mi) are

described as follows:

= N ·

(A · D − B ·C)

(A +C) · (B + D) · (A + B) · (C + D)

(7)

ig = −

A + B

· log

A + B

· log

A +C

· log

B + D

(8)

or = log



A · D

B ·C



(9)

mi = log



A · N

(A + B) · (A +C)



(10)

Any supervised feature selection scheme can be

used for the term weighting. For example, the gss ex-

tension of the χ

proposed by (Galavotti et al., 2000)

eliminates N at numerator and the emphasis to rare

features and categories at the denominator.

gss =

A · D − B ·C

(11)

(Largeron et al., 2011) propose a scheme called

Entropy-based Category Coverage Difference (eccd)

based on the distribution of the documents containing

the term and its categories, taking into account the

entropy of the term.

eccd =

A · D − B ·C

(A + B) · (C + D)

max

− E(t

)

max

(12)

E(t

) = Shannon Entropy = −

∑

∈C

t f

· log

where tf

is the term frequency of the term t

in the

category c

(Liu et al., 2009) propose a prob-based scheme,

combining the ratios A/C measuring the relevance of

a term t

for the category c

and A/B, since a term

with this ratio high means that it is often present in

the documents of c

and thus highly representative.

prob-based = log



1 +



(13)

Another similar scheme is tf.rf, proposed by (Lan

et al., 2009): it takes into account the terms distri-

bution in the positive and negative examples, stat-

ing that, for a multi-label text categorization task, the

higher the concentration of high-frequency terms in

the positive examples than in the negative ones, the

greater the contribution to categorization.

rf = log



2 +

max(1,C)



(14)

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf

Combining this idea with the icf factor, (Wang

and Zhang, 2013) propose a variant of tf.icf called icf-

based.

icf-based = log



2 +

max(1,C)

|C |

∈ C : t

∈ c



(15)

(Ren and Sohrab, 2013) implement a category

indexing-based tf.idf.icf observational term weighting

scheme, where the inverse category frequency is in-

corporated in the standard tf.idf.icf to favor the rare

terms and is biased against frequent terms. Therefore,

they revised the icf function implementing a new in-

verse category space density frequency (ics

f ), gen-

erating the tf.idf.ics

f scheme that provides a positive

discrimination on infrequent and frequent terms. The

inverse category space density frequency is denoted

as:

ics

f (t

) = log



|C |

∑

∈C

)



(16)

) =

A +C

(Song and Myaeng, 2012) proposes a term

weighting scheme that leverages availability of past

retrieval results, consisting of queries that contain a

particular term, retrieved documents, and their rel-

evance judgments. They assign a term weight de-

pending on the degree to which the mean frequency

values for the past distributions of relevant and non-

relevant documents are different. More precisely, it

takes into account the rankings and similarity values

of the relevant and non-relevant documents. (Ropero

et al., 2012) introduce a novel fuzzy logic-based term

weighting scheme for information extraction.

Another different approach to supervise term

weighting is proposed by (Luo et al., 2011): they do

not use the statistical information of terms in doc-

uments like methods mentioned above, but a term

weighting scheme that exploits the semantics of cat-

egories and terms. Speciﬁcally, a category is rep-

resented by the semantic sense, given by the lexical

database WordNet, of the terms contained in its own

label; while the weight of each term is correlated to

its semantic similarity with the category. (Bloehdorn

and Hotho, 2006) propose a hybrid approach for doc-

ument representation based on the common term stem

representation enhanced with concepts extracted from

ontologies.

3 A SUPERVISED VARIANT OF

INVERSE DOCUMENT

FREQUENCY

Here we introduce a supervised variant of tf.idf. The

basic idea of our proposal is to avoid decreasing the

weight of terms contained in documents belonging to

the same category, so that words that appear in sev-

eral documents of the same category are not disad-

vantaged, as instead happens in the standard formu-

lation of idf. We refer to this variant with the name

idfec (Inverse Document Frequency Excluding Cat-

egory). Therefore, the proposed category frequency

factor scheme is formulated as:

idfec (t

) = log



\ c

| + 1

max(1,|d ∈ D

\ c

: t

∈ d|)



(17)

where “D

\ c

” denotes training documents not la-

beled with c

. Using the previous notation, the for-

mula becomes:

tf.idfec (t

) = t f (t

) · log



C + D

max(1,C)



(18)

Note that with this variant of idf we can have partic-

ular cases. If the i-th word is only contained in j-th

document, or only in documents belonging to c

, the

denominator becomes 0. To prevent division by 0, the

denominator is replaced by 1 in this particular case.

The tf.idfec scheme is expected to improve clas-

siﬁcation effectiveness over tf.idf because it discrim-

inates where each term appears. For any category c

the importance of a term appearing in many docu-

ments outside of it is penalized as in tf.idf. On the

other side, the importance is not reduced by appear-

ances in the positive examples of c

, so that any term

appearing mostly within the category itself retains an

high global weight.

This scheme is similar to tf.rf (Lan et al., 2009),

as both penalize weights of a term t

according to the

number of negative examples where the t

appears.

The difference is in the numerator of the fraction,

which values positive examples with the term in rf

and negative ones without it in idfec.

To illustrate these properties, we use the following

numerical example. Considering the notation shown

in Table 2, suppose we have a corpus of 100 training

documents divided as shown in Table 3, for two terms

and t

and a category c

We can easily note how the term t

is very rep-

resentative, and then discriminant, for the category

since it is very frequent within it (A/(A + B)) =

27/30) and not in the rest of the documents (C/(C +

D) = 5/70). Similarly we can see that t

does not

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

Table 3: Example of document distribution for two terms.

27 5 t

10 25

3 65 t

20 45

seem to be a particularly discriminating term for c

In the standard formulation, the idf is

idf(t

) = log(100/(27 + 5)) = log(3.125)

and for our best competitor rf is

r f (t

) = log(2 + 27/5) = log(7.4)

while with the idfec we obtain

idfec (t

) = log((65 + 5)/5) = log(14)

For t

we have instead:

idf(t

) = log(100/(10 + 25)) = log(2.857)

r f (t

) = log(2 + 10/25) = log(2.4)

idfec (t

) = log((45 + 25)/25) = log(2.8)

We can see that our supervised version of idf can

separate the weights of the two terms according to the

frequency of terms in documents belonging to c

not. In fact, while with the standard idf the weights

of t

and t

are very similar, with idfec t

has a weight

much greater than t

since t

is more frequent and dis-

criminative for the category c

. This kind of behavior

is also exhibited by rf, but our method yields an even

higher weight for the relevant term t

In its base version, tf.idfec takes into account only

the negative examples (C and D in Table 2). Instead it

could be helpful, especially for the classiﬁcation task,

also to take into account how many documents be-

longing to c

contain the term, i.e. how much the

term occurs within the category more than in the rest

of the collection. Considering this, in a way similar to

(Wang and Zhang, 2013), we propose to mix our idea

with that of the rf in a new version of our weighting

scheme, called tf.idfec-based (tf.idfec-b. for short) and

expressed by the following formula:

tf.idfec-b.(t

) = t f (t

) · log



2 +

A +C +D

max(1,C)



(19)

Using the example in the Table 3, the new term

weighting scheme becomes for t

and t

respectively:

idfec-b.(t

) = log(2 + ((27 + 5 +65)/5)) = log(21.4)

idfec-b.(t

) = log(2+((10+25+45)/25)) = log(5.2)

With this term weighting scheme, the difference in

weight between a very common term (t

) and a very

discriminative one (t

) is even more pronounced.

4 EXPERIMENTAL SETUP

We performed extensive experimental evaluation to

compare the effectiveness of the proposed term

weighting approach with other schemes. In the fol-

lowing, we describe in detail the organization of these

experiments.

4.1 Benchmark Datasets

We used two commonly employed text collections as

benchmarks in our experiments.

The Reuters-21578 corpus

consists in 21,578

articles collected from Reuters. According to the

ModApt

e split, 9,100 news stories are used: 6,532 for

the training set and 2,568 for the test set. One in-

trinsic problem of the Reuters corpus is the skewed

category distribution. In the top 52 categories, the

two most common categories (earn and acq) contain,

respectively, 43% and 25% of the documents in the

dataset, while the average document frequency of all

categories is less than 2%. In literature, this dataset

is used considering a various number of categories:

we considered two views of this corpus, Reuters-10

and Reuters-52 where only the 10 and the 52 most

frequent categories are considered, respectively.

The 20 Newsgroups corpus

is a collection of

18,828 Usenet posts partitioned across 20 discus-

sion groups. Some newsgroups are very closely re-

lated to each other (e.g. comp.sys.ibm.pc. hardware

/ comp.sys.mac.hardware), while others are highly

unrelated (e.g misc.forsale / soc.religion.christian).

Likely to (Lan et al., 2009) we randomly selected 60%

of documents as training instances and the remaining

40% make up the test set. Contrarily to Reuters, doc-

uments of 20 Newsgroups are distributed rather uni-

formly across categories.

4.2 Classiﬁcation Process

For each dataset, all documents were pre-processed

by removing punctuation, numbers and stopwords

from a predeﬁned list, then by applying the common

SnowballStemmer to remaining words. In this way,

we obtained a total of 16,145 distinct terms in the

Reuters-21578 corpus and 61,483 in 20 Newsgroups.

We performed feature selection on these terms to

keep only a useful subset of them. Speciﬁcally, we

extracted for each category the p terms appearing in

most of its documents, where for p all the following

values were tested: 25, 50, 75, 150, 300, 600, 900,

http://www.daviddlewis.com/resources/testcollections/

reuters21578/.

http://people.csail.mit.edu/jrennie/20Newsgroups/.

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf

1200, 1800, 2400, 4800, 9600. This feature selection

method may be considered counterproductive since

we selected the most common terms, but it is actu-

ally correct considering the use of the VSM as the

terms result to be as less scattered as possible. The

task of term weighting is therefore crucial to increase

the categorization effectiveness, giving a weight to

each term according to the category to which the doc-

uments belong.

Since we tested both supervised and unsupervised

term weighting methods, we used two different pro-

cedures. For unsupervised methods we processed the

training set in order to calculate the collection fre-

quency factor for each term, which was then mul-

tiplied by the logarithmic term frequency factor (re-

ferred to as tf in the following) for each term in train-

ing and test set. Finally, cosine normalization (Eq. 1)

was applied to normalize the term weights.

For supervised methods we used the multi-label

categorization approach, where a binary classiﬁer is

created for each category. That is, for each category

, training documents labeled with it are tagged as

positive examples, while the remaining one constitute

negative examples. We computed statistical informa-

tion related to c

(as described in Table 2) for each

term of training documents. The weight of each term

was calculated multiplying its tf with the global fac-

tor computed on the training set; ﬁnally cosine nor-

malization was performed.

4.3 Learning Algorithms

We chose to use support vector machines (SVM),

which are usually the best learning approach in text

categorization (Lan et al., 2009; Sebastiani, 2002).

We used the well known SVM

Light

implementation

(Joachims, 1998), testing both the linear kernel and

the radial basis function (RBF) kernel.

Furthermore, to test the effectiveness of classiﬁca-

tion by varying the term weighting scheme with an-

other algorithm, we used the Weka implementation of

Random Forest (Breiman, 2001), chosen for both its

effectiveness and its speed. As parameters we set the

number of trees to I = 10 and the number of features

to K = 50.

4.4 Performance Measures

We measured the effectiveness in terms of precision

(π) and recall (ρ), deﬁned in the usual way for text cat-

egorization (Lewis, 1995). As a measure of effective-

ness that combines π and ρ we used the well-known

http://svmlight.joachims.org/

measure, deﬁned as:

2 · π · ρ

π + ρ

(20)

For multi-label problems, the F

is estimated in two

ways: micro-averaged F

and macro-averaged F

(Se-

bastiani, 2002). The micro-F

sums up the individual

true positives, false positives and false negatives of

the different classiﬁers and applies them to get the F

The macro-F

is instead the average of the F

to each category.

4.5 Signiﬁcance Tests

To evaluate the difference of performances between

term weighting methods, we employed the McNe-

mar’s signiﬁcance test (Dietterich, 1998; Lan et al.,

2005; Lan et al., 2009), used to compare the distribu-

tion of counts expected under the null hypotesis to the

observed counts.

Let’s consider two classiﬁers f

and f

trained

from the same documents but with two different term

weighting methods and evaluated using the same test

set. The outcome of categorization for all test in-

stances is summarized in a contingency table, as

shown in Table 4.

Table 4: McNemar’s test contingency table.

: Number of istances

misclassiﬁed by both

classiﬁers f

and f

: Number of istances

misclassiﬁed by f

but

not by f

: Number of istances

misclassiﬁed by f

but

not by f

: Number of istances

correctly classiﬁed by

both classiﬁers f

and f

The null hypotesis for the McNemar’s signiﬁcance

test states that on the same test instances, two classi-

ﬁers f

and f

will have the same prediction errors,

which means that n

= n

. So the χ statistic is de-

ﬁned as:

χ =

(|n

− n

| − 1)

+ n

(21)

χ is approximately distributed as a χ

distribution

with 1 degree of freedom, where the signiﬁcance

levels 0.01 and 0.001 correspond respectively to the

thresholds χ

= 6.64 and χ

= 10.83. If the null hy-

potesis is correct, than the probability that χ is greater

than 6.64 is less than 0.01, and similarly 0.001 for a

value greater than 10.83. Otherwise we may reject

the null hypotesis in favor of the hypotesis that f

and

have different performance and therefore the two

term weighting schemes have different impact when

used on the particular training set.

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

Table 5: Micro-averaged F

(in thousandths) best results obtained with each term weighting scheme (“b.” is short for “based”)

on the three dataset with different learning algorithms. The best result for each dataset and algorithm is marked in bold.

tf.idfec tf.idfec-b. tf.rf tf.icf-b. tf.idf tf.χ

eccd tf.gss tf.ig midf tf.oddsR rw tf bin

Reuters-10

SVM(LIN) .929 .933 .933 .930 .930 .847 .839 .863 .852 .920 .918 .914 .932 .916

SVM(RBF) .932 .937 .937 .935 .933 .879 .853 .895 .882 .923 .926 .922 .934 .924

RandomForest .904 .902 .903 .899 .903 .901 .885 .901 .903 .898 .903 .899 .902 .897

Reuters-52

SVM(LIN) .920 .925 .922 .916 .917 .828 .811 .848 .822 .882 .890 .881 .912 .881

SVM(RBF) .925 .927 .926 .924 .922 .848 .828 .873 .848 .886 .895 .887 .915 .887

RandomForest .855 .868 .867 .853 .858 .863 .847 .861 .864 .858 .863 .856 .866 .861

20 Newsgroups

SVM(LIN) .754 .759 .759 .747 .741 .567 .450 .512 .520 .606 .666 .702 .709 .702

SVM(RBF) .712 .713 .712 .713 .702 .587 .490 .555 .560 .609 .664 .674 .677 .675

RandomForest .536 .570 .563 .537 .543 .568 .511 .570 .565 .536 .562 .556 .566 .555

Table 6: Macro-averaged F

(in thousandths) best results obtained with each term weighting scheme (“b.” is short for “based”)

on the three dataset with different learning algorithms. The best result for each dataset and algorithm is marked in bold.

tf.idfec tf.idfec-b. tf.rf tf.icf-b. tf.idf tf.χ

eccd tf.gss tf.ig midf tf.oddsR rw tf bin

Reuters-10

SVM(LIN) .880 .886 .892 .889 .878 .692 .732 .740 .679 .841 .863 .858 .880 .863

SVM(RBF) .888 .889 .902 .899 .883 .816 .755 .835 .801 .846 .876 .872 .883 .874

RandomForest .843 .850 .853 .849 .847 .855 .825 .847 .846 .843 .849 .838 .856 .839

Reuters-52

SVM(LIN) .581 .596 .583 .593 .574 .316 .293 .320 .282 .244 .520 .348 .465 .348

SVM(RBF) .611 .623 .595 .643 .613 .411 .335 .367 .341 .271 .550 .341 .471 .341

RandomForest .291 .363 .355 .310 .308 .337 .283 .302 .341 .311 .329 .327 .361 .335

20 Newsgroups

SVM(LIN) .739 .744 .744 .733 .724 .530 .405 .467 .476 .578 .650 .681 .689 .681

SVM(RBF) .691 .692 .688 .692 .682 .552 .453 .517 .521 .582 .650 .650 .654 .651

RandomForest .496 .537 .532 .497 .504 .535 .470 .536 .531 .498 .527 .521 .532 .519

5 EXPERIMENTAL RESULTS

We tested the effectiveness of classiﬁcation varying

the term weighting scheme on several datasets. For

each dataset we tested the classiﬁcation varying the

number of features p selected for each category. For

ease of reporting, in tables 5 and 6 we show the best

result for each dataset obtained by both SVM

Light

in linear and radial basis function and RandomForest

classiﬁers using each term weighting scheme.

Tables 5 and 6 show the performance of 14 differ-

ent term weighting methods: tf.idfec, tf.idfec-based,

tf.rf, tf.icf-based, tf.idf, tf.χ

, eccd, tf.gss, tf.ig, midf,

tf.oddsR, rw, tf and bin, in terms of micro-F

and

macro-F

on the three datasets previously described.

In general, our second proposed scheme tf.idfec-

based achieved top results in all datasets and with

each classiﬁer.

In particular, on Reuters-52, our tf.idfec-based

outperforms every other scheme with all classiﬁers

in terms of micro-F

: using a SVM with linear ker-

nel, compared with the standard tf.idf we have an im-

provement of 0.8% of micro-F

and 2.2% of macro-

. Compared with standard supervised schemes such

as tf.ig and tf.χ

we have an improvement respectively

of 10.3% and 9.7% of micro-F

and 31.4% and 28%

of macro-F

. Furthermore, tf.idfec-based outperforms

albeit slightly the tf.rf and also the tf.icf-based in

terms of micro-F

. It is possible to note a marked dif-

ference between the micro and macro-F

values, this

is due to the strong unbalancing of the classes in this

dataset; the macro-F

is strongly negatively biased by

classes with few documents, for which the classiﬁ-

cation has low effectiveness, thus in this dataset the

micro-F

is much more meaningful.

On Reuters-10 we note that the results obtained

with our proposed methods are very close to tf.rf,

tf.icf-based and also to the standard tf.idf and tf

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf

Table 7: McNemar’s signiﬁcance test results. Each column is related to a dataset and a supervised classiﬁer, the term weighting

schemes are shown in decreasing order of effectiveness, separating groups of schemes by signiﬁcance of differences in their

perfomances. Results for RandomForest on Reuters-10 are omitted as no signiﬁcant difference is observed between them.

Reuters-10 Reuters-52 20 Newsgroups

SVM(LIN) SVM(RBF) SVM(LIN) SVM(RBF) RandomFor. SVM(LIN) SVM(RBF) RandomFor.

tf.idfec-b.

tf.rf

tf.icf-b.

tf.idf

tf.idfec

midf

tf.oddsR

bin

tf.gss

tf.ig

tf.χ

eccd

tf.idfec-b.

tf.rf

tf.icf-b.

tf.idf

tf.idfec

tf.oddsR

bin

midf

tf.gss

tf.ig

tf.χ

)

eccd

tf.idfec-b.

tf.rf

tf.idfec

tf.idf

tf.icf-b.

tf.oddsR

bin

midf

tf.gss

tf.ig

tf.χ

eccd

tf.idfec-b.

tf.rf

tf.idfec

tf.icf-b.

tf.idf

tf.oddsR

bin

midf

tf.gss

tf.χ

tf.ig

eccd

tf.idfec-b.

tf.rf

tf.ig

tf.oddsR

tf.χ

tf.gss

bin

tf.idf

midf

tf.idfec

tf.icf-b.

eccd

tf.idfec-b.

tf.rf

tf.idfec

tf.icf-based

tf.idf

bin

tf.oddsR

midf

tf.χ

tf.ig

tf.gss

eccd

tf.idfec-b.

tf.icf-b.

tf.idfec

tf.rf

tf.idf

bin

tf.oddsR

midf

tf.χ

tf.ig

tf.gss

eccd

tf.idfec-b.

tf.gss

tf.χ

tf.ig

tf.rf

tf.oddsR

bin

tf.idf

tf.icf-b.

tf.idfec

midf

eccd

schemes. These results show how in this dataset, con-

sisting of only 10 categories, using a supervised term

weighting method is not relevant to the effectiveness

of classiﬁcation: this can be deduced from the differ-

ence between the effectiveness of standard tf.idf and

our supervised versions on Reuters-52, which con-

tains the same documents of Reuters-10 but labeled

with more categories. However, our schemes out-

perform standard supervised term weighting by more

than 10%.

The results on 20 Newsgroups show that tf.idfec-

based obtains the best micro-F

, in parity with tf.rf us-

ing linear SVM, with tf.icf-based using radial kernel

SVM and with tf.gss using RandomForest. Using lin-

ear SVM, the best micro-F

of tf.idfec-based is higher

by 1.8% compared to that obtained from tf.idf and of

23.9% compared with a standard supervised method

like tf.ig.

Observing all the results, we can see that our ﬁrst

proposed scheme tf.idfec obtains results always in line

but slightly lower than the tf.idfec-based variant. This

evidently means that considering only the informa-

tion about the negative categories of the terms is not

enough to achieve an optimal accuracy. Conversely,

adding information about the ratio between A and C

(from notation of Table 2), it is obtained an optimal

mixture that leads to better classiﬁcation results, us-

ing either SVM or RandomForest classiﬁers.

We employ the McNemar’s signiﬁcance test to

verify the statistical difference between performances

of the term weighting schemes. We report these re-

sults in Table 7, where each column is related to

a dataset and a classiﬁer and the term weighting

schemes are shown in decreasing order of effective-

ness, separated according to the signiﬁcance of differ-

ence between them. Schemes not separated with lines

do not give signiﬁcantly different results, a single line

denotes that the schemes above perform better than

the schemes below with a signiﬁcance level between

0.01 and 0.001 (commonly indicated as “A > B”),

while a double line denotes a signiﬁcance level bet-

ter than 0.001 (“A >> B”). To save space, we do not

report the results on the Reuters-10 corpus with Ran-

domForest, because there are no signiﬁcant statistical

differences between them. From Table 7 we can note

that our proposed tf.idfec-based scheme always pro-

vides top effectiveness. In addition, this table shows

that with SVM classiﬁers, either linear or RBF ker-

nel, some term weighting methods are more efﬁcient

than others. The best methods in general seem to be

the latest supervised methods, such as tf.idfec-based,

tf.rf, tf.icf-based and tf.idfec. Instead, using Random-

Forest, the classic supervised methods seem to work

better, with results comparable or slightly below with

respect to tf.idfec-based.

Let’s now observe how classiﬁcation effectiveness

varies by changing the number of features consid-

ered to create the dictionary of the VSM. For rea-

sons of readability of the graph, we do not show

all the 14 term weighting methods investigated, but

only the ﬁve most recent and with better results, i.e.

tf.idfec, tf.idfec-based, tf.rf, tf.icf-based, tf.idf. For

each dataset, we show the results obtained using an

SVM classiﬁer with a linear kernel.

Figures 1 and 2 show that on both views of Reuters

corpus, when using tf.idfec-based we obtain the best

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

Number of features

Micro-averaged F

tf.idfec tf.idfec-based tf.rf

tf.icf-based tf.idf

Figure 1: Results obtained on Reuters-10 varying the num-

ber of top p features selected per category using linear SVM

classiﬁer. The X axis (in logarithmic scale) indicates the re-

sulting total number of features.

0.87

0.88

0.89

0.9

0.91

0.92

0.93

Number of features

Micro-averaged F

tf.idfec tf.idfec-based tf.rf

tf.icf-based tf.idf

Figure 2: Results obtained on Reuters-52 varying the num-

ber of top p features selected per category using linear SVM

classiﬁer. The X axis (in logarithmic scale) indicates the re-

sulting total number of features.

results by using few features per category, considering

the variations of p described in Section 4.2. We note

that the best results on Reuters-10 are obtained with

150 features per category and on Reuters-52 with 75

0.55

0.6

0.65

0.7

0.75

0.8

Number of features

Micro-averaged F

tf.idfec tf.idfec-based tf.rf

tf.icf-based tf.idf

Figure 3: Results obtained on 20 Newsgroups varying the

number of top p features selected per category using linear

SVM classiﬁer. The X axis (in logarithmic scale) indicates

the resulting total number of features.

features, corresponding respectively to overall dictio-

nary sizes of 498 and 970. From these plots, how-

ever, we can see as the effectiveness of the classiﬁ-

cation using tf.idfec-based deteriorates by increasing

the number of features considered, therefore introduc-

ing terms less frequent and discriminative. Analysing

the behavior of the schemes from which idfec-based

takes its cue, i.e. standard tf.idf and tf.rf, we note that

this performance degradation is probably due to the

idf factor of the weight, as even the tf.idf has the same

type of trend results, while tf.rf seems to remain sta-

ble at values comparable to the best results also by

increasing the dictionary size.

Figure 3 shows how to obtain the best results with

the 20 Newsgroups corpus is necessary a greater num-

ber of features. Using tf.idfec-based we obtain the

best result with a dictionary of about 10000 features;

after that the efﬁciency shows a slight decrease, but

more moderate than that shown in the Reuters dataset.

6 CONCLUSIONS

We proposed two novel supervised variations of the

tf.idf term weighting approach. In the ﬁrst one, we

employ the basic idea of tf.idf, but considering only

documents not belonging to the modeled category:

this prevents having a low idf factor if a term is

largely present in the documents of the category un-

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf

der consideration. The second variant uses a mixture

of the previous idea and the relevance frequency rf in

order to consider also the amount of documents be-

longing to the category in which the term occurs.

We performed extensive experimental studies on

two datasets, i.e. Reuters corpus with either 10 or

52 categories and 20 Newsgroups, and three different

classiﬁcation methods, i.e. SVM classiﬁer with linear

and RBF kernel functions and RandomForest. The

obtained results show that the tf.idfec-based method

that combines idfec and rf generally gets top results

on all datasets and with all classiﬁers. Through statis-

tical signiﬁcance tests, we showed that the proposed

scheme always achieves top effectiveness and is never

worse than other methods. The results put in evi-

dence a close competition between our tf.idfec-based

and the tf.rf schemes; in particular, the best results

obtained with the different datasets and algorithms,

varying the amount of feature selection, are very sim-

ilar, but with some differences: tf-rf seems to be more

stable when the number of features is high, while our

tf.idfec-based gives excellent results with few features

and shows some decay (less than 4%) when the num-

ber of features increases.

As future works, we plan to apply this idea to

larger datasets and to hierarchical text corpora, using

a variation of idfec able to take into account the tax-

onomy of categories. We also plan to test the use of

this scheme for feature selection, other than for term

weighting. Moreover, we are going to investigate the

effectiveness of this variant of tf.idf in other ﬁelds

where weighting schemes can be employed with pos-

sible efﬁcacy improvements, such as sentiment anal-

ysis and opinion mining.

REFERENCES

Bloehdorn, S. and Hotho, A. (2006). Boosting for text

classiﬁcation with semantic features. In Mobasher,

B., Nasraoui, O., Liu, B., and Masand, B., editors,

Advances in Web Mining and Web Usage Analysis,

volume 3932 of Lecture Notes in Computer Science,

pages 149–166. Springer Berlin Heidelberg.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Carmel, D., Mejer, A., Pinter, Y., and Szpektor, I. (2014).

Improving term weighting for community question

answering search using syntactic analysis. In Pro-

ceedings of the 23rd ACM International Conference

on Conference on Information and Knowledge Man-

agement, CIKM ’14, pages 351–360, New York, NY,

USA. ACM.

Debole, F. and Sebastiani, F. (2003). Supervised term

weighting for automated text categorization. In In

Proceedings of SAC-03, 18th ACM Symposium on Ap-

plied Computing, pages 784–788. ACM Press.

Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., and Ramraj,

N. (2010). A novel term weighting scheme midf for

text categorization. Journal of Engineering Science

and Technology, 5(1):94–107.

Deng, Z.-H., Luo, K.-H., and Yu, H.-L. (2014). A study of

supervised term weighting scheme for sentiment anal-

ysis. Expert Systems with Applications, 41(7):3506–

3513.

Deng, Z.-H., Tang, S.-W., Yang, D.-Q., Li, M. Z. L.-Y., and

Xie, K.-Q. (2004). A comparative study on feature

weight in text categorization. In Advanced Web Tech-

nologies and Applications, pages 588–597. Springer.

Dietterich, T. G. (1998). Approximate statistical tests

for comparing supervised classiﬁcation learning algo-

rithms. Neural Comput., 10(7):1895–1923.

Domeniconi, G., Moro, G., Pasolini, R., and Sartori, C.

(2014). Cross-domain text classiﬁcation through it-

erative reﬁning of target categories representations. In

Proceedings of the 6th International Conference on

Knowledge Discovery and Information Retrieval.

Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experi-

ments on the use of feature selection and negative ev-

idence in automated text categorization. In Research

and Advanced Technology for Digital Libraries, pages

59–68. Springer.

Hassan, S. and Banea, C. (2006). Random-walk term

weighting for improved text classiﬁcation. In In Pro-

ceedings of TextGraphs: 2nd Workshop on Graph

Based Methods for Natural Language Processing.

ACL, pages 53–60.

Joachims, T. (1998). Text categorization with support vec-

tor machines: Learning with many relevant features.

Lan, M., Sung, S.-Y., Low, H.-B., and Tan, C.-L. (2005). A

comparative study on term weighting schemes for text

categorization. In Neural Networks, 2005. IJCNN’05.

Proceedings. 2005 IEEE International Joint Confer-

ence on, volume 1, pages 546–551. IEEE.

Lan, M., Tan, C. L., Su, J., and Lu, Y. (2009). Supervised

and traditional term weighting methods for automatic

text categorization. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 31(4):721–735.

Largeron, C., Moulin, C., and G

ery, M. (2011). Entropy

based feature selection for text categorization. In Pro-

ceedings of the 2011 ACM Symposium on Applied

Computing, SAC ’11, pages 924–928, New York, NY,

USA. ACM.

Leopold, E. and Kindermann, J. (2002). Text categorization

with support vector machines. how to represent texts

in input space? Mach. Learn., 46(1-3):423–444.

Lewis, D. D. (1995). Evaluating and optimizing au-

tonomous text classiﬁcation systems. In Proceedings

of the 18th annual international ACM SIGIR confer-

ence on Research and development in information re-

trieval, SIGIR ’95, pages 246–254, New York, NY,

USA. ACM.

Liu, Y., Loh, H. T., and Sun, A. (2009). Imbalanced text

classiﬁcation: A term weighting approach. Expert

Syst. Appl., 36(1):690–701.

DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications

Luo, Q., Chen, E., and Xiong, H. (2011). A semantic term

weighting scheme for text categorization. Expert Syst.

Appl., 38(10):12708–12716.

Paltoglou, G. and Thelwall, M. (2010). A study of informa-

tion retrieval weighting schemes for sentiment anal-

ysis. In Proceedings of the 48th Annual Meeting of

the Association for Computational Linguistics, ACL

’10, pages 1386–1395, Stroudsburg, PA, USA. Asso-

ciation for Computational Linguistics.

Papineni, K. (2001). Why inverse document frequency? In

Proceedings of the second meeting of the North Amer-

ican Chapter of the Association for Computational

Linguistics on Language technologies, pages 1–8. As-

sociation for Computational Linguistics.

Ren, F. and Sohrab, M. G. (2013). Class-indexing-based

term weighting for automatic text classiﬁcation. Inf.

Sci., 236:109–125.

Robertson, S. (2004). Understanding inverse document fre-

quency: on theoretical arguments for idf. Journal of

documentation, 60(5):503–520.

Ropero, J., G

omez, A., Carrasco, A., and Le

on, C. (2012).

A fuzzy logic intelligent agent for information ex-

traction: Introducing a new fuzzy logic-based term

weighting scheme. Expert Systems with Applications,

39(4):4567–4581.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Comput. Surv., 34(1):1–47.

Song, S.-K. and Myaeng, S. H. (2012). A novel term

weighting scheme based on discrimination power ob-

tained from past retrieval results. Inf. Process. Man-

age., 48(5):919–930.

Sparck Jones, K. (1988). Document retrieval systems. chap-

ter A statistical interpretation of term speciﬁcity and

its application in retrieval, pages 132–142. Taylor Gra-

ham Publishing, London, UK, UK.

Tokunaga, T. and Makoto, I. (1994). Text categorization

based on weighted inverse document frequency. In

Special Interest Groups and Information Process So-

ciety of Japan (SIG-IPSJ. Citeseer.

Tsai, F. S. and Kwee, A. T. (2011). Experiments in term

weighting for novelty mining. Expert Systems with

Applications, 38(11):14094–14101.

Wang, D. and Zhang, H. (2013). Inverse-category-

frequency based supervised term weighting schemes

for text categorization. Journal of Information Science

and Engineering, 29(2):209–225.

AStudyonTermWeightingforTextCategorization:ANovelSupervisedVariantoftf.idf