Temporal-based Feature Selection and Transfer Learning for

Text Categorization

Fumiyo Fukumoto and Yoshimi Suzuki

Graduate Faculty of Interdisciplinary Research, University of Yamanashi, Kofu, Japan

Keywords:

Feature Selection, Latent Dirichlet Allocation, Temporal-based Features, Text Categorization, Timeline

Adaptation, Transfer Learning.

Abstract:

This paper addresses text categorization problem that training data may derive from a different time period

from the test data. We present a method for text categorization that minimizes the impact of temporal ef-

fects. Like much previous work on text categorization, we used feature selection. We selected two types of

informative terms according to corpus statistics. One is temporal independent terms that are salient across

full temporal range of training documents. Another is temporal dependent terms which are important for a

speciﬁc time period. For the training documents represented by independent/dependent terms, we applied

boosting based transfer learning to learn accurate model for timeline adaptation. The results using Japanese

data showed that the method was comparable to the current state-of-the-art biased-SVM method, as the macro-

averaged F-score obtained by our method was 0.688 and that of biased-SVM was 0.671. Moreover, we found

that the method is effective, especially when the creation time period of the test data differs greatly from that

of the training data.

1 INTRODUCTION

Text categorization supports to improve many tasks

such as automatic topic tagging, building topic direc-

tory, spam ﬁltering, creating digital libraries, senti-

ment analysis in user reviews, information retrieval,

and even helping users to interact with search engines

(Mourao et al., 2008). A growing number of machine

learning (ML) techniques have been applied to the

text categorization task (Xue et al., 2008; Gopal and

Yang, 2010). For reasons of both efﬁciency and ac-

curacy, feature selection is often used since the early

1990s when applying machine learning methods to

text categorization (Lewis and Ringuette, 1994; Yang

and Pedersen, 1997; Dumais and Chen, 2000). Each

document is represented using a vector of selected

features/terms (Yang and Pedersen, 1997; Hassan

et al., 2007). Then, the documents with category label

are used to train classiﬁers. Once category models are

trained, each test document is classiﬁed by using these

models. A basic assumption in the categorization task

is that the distributions of terms between training and

test documents are identical. When the assumption is

not hold, the classiﬁcation accuracy was worse. How-

ever, it is often the case that the term distribution in

the training data is different from that of the test data

when the training data may drive from a different time

period from the test data. For instance, the term “Al-

cindo” frequently appeared in the documents tagged

“Sports” category in 1994. This is reasonable because

Alcindo is a Brazilian soccer player and he was one

of the most loved players in 1994. However, the term

did not occur in more frequently in the Sports cate-

gory since he retired in 1997. The observation show

that the informativeterm appeared in the training data,

is not informative in the test data when training data

may derive from a different time period from the test

data, e.g., in the above example, the term “Alcindo”

is informative in the training data with Sports cate-

gory collected from 1994, but not informative in the

test data from other years, e.g., 2005 which should be

classiﬁed into the Sports category. Moreover, man-

ual annotation of tagged new data is very expensive

and time-consuming. The methodology for accurate

classiﬁcation of the new test data by making the max-

imum use of tagged old data is needed in both feature

selection and learning techniques.

In this paper, we present a method for text cate-

gorization that minimizes the impact of temporal ef-

fects. We selected two types of salient terms by us-

ing a simple feature selection technique, χ

statistics.

One is temporal independent terms that are salient

Fukumoto, F. and Suzuki, Y..

Temporal-based Feature Selection and Transfer Learning for Text Categorization.

In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2015) - Volume 1: KDIR, pages 17-26

ISBN: 978-989-758-158-8

across full temporal range of training documents such

as “baseball” and “tennis” in the Sports category. An-

other is temporal dependent terms that are salient for

a speciﬁc time period such as “Alcindo” in the Sports

category in 1994 mentioned in the above example.

Hereafter, we call it temporal-based feature selec-

tion (TbFS). As the result of TbFS, each document

is represented by using a vector of the selected inde-

pendent/dependent terms, and classiﬁers are trained.

We applied boosting based transfer learning, called

TrAdaboost (Dai et al., 2007) in order to minimize

the impact of temporal effects. Hereafter, we call it

temporal-based transfer learning, TbTL. The idea is

to use TrAdaboost to decrease the weights of training

instances that are very different from the test data.

The rest of the paper is organized as follows. The

next section describes an overview of existing related

work. Section 3 presents our approach, especially

describes how to adjust temporal difference between

training and test documents. Finally, we report some

experiments with a discussion of evaluation.

2 RELATED WORK

The analysis of temporal aspects is a practical prob-

lem as well as the process of large-scale heteroge-

neous data since the World-Wide Web (WWW) is

widely used by various sorts of people. It is widely

studied in many text processing tasks. One attempt

is concept or topic drift dealing with temporal ef-

fects (Kleinberg, 2002; Lazarescu et al., 2004; Folino

et al., 2007). The earliest known approach is the

work of (Klinkenberg and Joachims, 2000). They

presented a method to handle concept changes with

SVMs. They used ξα-estimates to select the win-

dow size so that the estimated generalization error

on new examples is minimized. The results which

were tested on the TREC show that the algorithm

achieves a low error rate and selects appropriate win-

dow sizes. Wang et al. developed the continu-

ous time dynamic topic model (cDTM) (Wang et al.,

2008). The cDTM is an extension of the discrete dy-

namic topic model (dDTM). The dDTM is a pow-

erful model. However, the choice of discretization

affects the memory requirements and computational

complexity of posterior inference. cDTM replaces the

discrete state space model with its continuous gen-

eralization, Brownian motion. He et al. proposed a

method to ﬁnd bursts, periods of elevated occurrence

of events as a dynamic phenomenon instead of focus-

ing on arrival rates (He and Parker, 2010). They used

Moving Average Convergence/Divergence (MACD)

histogram which was used in technical stock market

analysis (Murphy, 1999) to detect bursts. They tested

the method using MeSH terms and reported that the

model works well for tracking topic bursts. He et al.

bursts model can be regarded as salient features/terms

identiﬁcation for a speciﬁc time period, although their

method can not extract such terms automatically, i.e.

it is necessary to give these terms in advance as the

input of their model.

Another attempt is domain adaptation. The goal

of this attempt is to develop learning algorithms that

can be easily ported from one domain to another,

e.g., from newswire to biomedical documents (III,

2007). Domain adaptation is particularly interest-

ing in Natural Language Processing (NLP) because

it is often the case that we have a collection of la-

beled data in one domain but truly desire a model that

can work well for another domain. Lots of studies

addressed domain adaptation in NLP tasks such as

part-of-speech tagging (Siao and Guo, 2013), named-

entity (III, 2007), and sentiment classiﬁcation (Glorot

et al., 2011) are presented. One approach to domain

adaptation is to use transfer learning. The transfer

learning is a learning technique that retains and ap-

plies the knowledge learned in one or more tasks to

efﬁciently develop an effective hypothesis for a new

task. The earliest discussion is done by ML com-

munity in a NIPS-95 workshop

, and more recently,

transfer learning techniques have been successfully

applied in many applications. Blitzer et al. proposed

a method for sentiment classiﬁcation using structual

correspondence learning that makes use of the un-

labeled data from the target domain to extract some

relevant features that may reduce the difference be-

tween the domains (Blitzer et al., 2006). Several au-

thors have attempted to learn classiﬁers across do-

mains using transfer learning in the text classiﬁcation

task (Raina et al., 2006; Dai et al., 2007; Sparinna-

pakorn and Kubat, 2007). Raina et al. proposed a

transfer learning algorithm that constructs an infor-

mative Baysian prior for a given text classiﬁcation

task (Raina et al., 2006). The prior encodes useful

domain knowledge by capturing underlying depen-

dencies between the parameters. They reported that

a 20 to 40% test error reduction over a commonly

used prior in the binary text classiﬁcation task. All

of these approaches mentioned above aimed at utiliz-

ing a small amount of newly labeled data to leverage

the old data to construct a high-quality classiﬁcation

model for the new data. However, the temporal effects

are not explicitly incorporated into their models.

To our knowledge, there have been only a few

previous works on temporal-based text categorization

http://socrates.acadiau.ca/courses/comp/dsilver/

NIPS95 LTL/transfer.workshop.1995.html.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

(Kerner et al., 2008; Song et al., 2014). Mourao et

al. investigated the impact of temporal evolution of

document collections based on three factors: (i) the

class distribution, (ii) the term distribution, and (iii)

the class similarity. They reported that these factors

have great inﬂuence in the performance of the clas-

siﬁers throughout the ACM-DL and Medline docu-

ment collections that span across more than 20 years

(Mourao et al., 2008). Salles et al. presented an ap-

proach to classify documents in scenarios where the

method uses information about both the past and the

future, and this information may change over time

(Salles et al., 2010). They address the drawbacks of

which instances to select by approximating the Tem-

poral Weighting Function (TWF) using a mixture of

two Gaussians. They applied TWF to every training

document. However, it is often the case that terms

with informative for a speciﬁc time period and in-

formative across the full temporal range of training

documents are both included in the training data that

affects overall performance of text categorization as

these terms are equally weighted in their approach.

Moreover, their method needs tagged training data

across full temporal range of training documents to

create TWF.

There are three novel aspects in our method.

Firstly, we propose a method for text categorization

that minimizes the impact of temporal effects in both

feature selection and learning techniques. Secondly,

from manual annotation of data perspective, we pro-

pose a temporal-based classiﬁcation method using

only a limited number of labeled training data. Fi-

nally, from the perspective of robustness, the method

is automated, and can be applied easily to a new do-

main, or different languages, given sufﬁcient unla-

beled documents.

3 SYSTEM DESIGN

The method consists of three steps: (1) Collection

of documents by Latent Dirichlet Allocation (LDA),

(2) Temporal-based feature selection (TbFS), and (3)

Document categorization by temporal-based transfer

learning (TbTL).

3.1 Collection of Documents by LDA

The selection of temporal independent/dependent

terms is done using documents with categories. How-

ever, manual annotation of categories are very ex-

pensive and time-consuming. Therefore, we used a

topic model and classiﬁed unlabeled documents into

categories. Topic models such as probabilistic latent

semantic indexing (Hofmann, 1999) and LDA (Blei

et al., 2003) are based on the idea that documents are

mixtures of topics, where each topic is captured by a

distribution over words. The topic probabilities pro-

vide an explicit low-dimensional representation of a

document. They have been successfully used in many

tasks such as text modeling and collaborative ﬁltering

(Li et al., 2013). We classiﬁed documents into cat-

egories using LDA. The generative process for LDA

can be described as follows:

1. For each topic k = 1, ···, K, generate φ

, multi-

nomial distribution of terms speciﬁc to the topic k

from a Dirichlet distribution with parameter β;

2. For each document d = 1, · ··, D, generate θ

multinomial distribution of topics speciﬁc to the

document d from a Dirichlet distribution with pa-

rameter α;

3. For each term n = 1, ··· , N

in document d;

(a) Generate a topic z

of the n

term in the doc-

ument d from the multinomial distribution θ

(b) Generate a term w

, the term associated with

the n

term in document d from multinomial

zdn

Like much previous work on LDA, we used Gibbs

sampling to estimate φ and θ. The sampling probabil-

ity for topic z

in document d is given by:

P(z

| z

,W) =

\i, j

+ β)(n

\i, j

+ α)

\i, j

+Wβ)(n

\i,·

+ Tα)

. (1)

refers to a topic set Z, not including the current as-

signment z

. n

\i, j

is the frequency of term v in topic

j that does not include the current assignment z

, and

\i, j

indicates a summation over that dimension. W

refers to a set of documents, and T denotes the total

number of unique topics. After a sufﬁcient number

of sampling iterations, the approximated posterior can

be used to estimate φ and θ by examining the frequen-

cies of term assignments to topics and topic occur-

rences in documents. The approximated probability

of topic k in the document d,

, and the assignments

term w to topic k,

are given by:

+ α

+ αK

. (2)

+ β

+ βV

. (3)

For each year, we applied LDA to a set of doc-

uments where a set consists of a small number of

Temporal-based Feature Selection and Transfer Learning for Text Categorization

labeled documents and a large number of unlabeled

documents. We need to estimate two parameters for

the results obtained by LDA, i.e. one is the number

of topics/classes k, and another is the number of doc-

uments d for each topic/class. We note that the result

can be regarded as a clustering result: each element

of the clusters is a document assigned to a category

or a document without a category information. We

estimated the numbers of topics and documents using

Entropy measure given by:

E = −

logk

∑

P(A

)logP(A

). (4)

k refers to the number of clusters. P(A

) is a prob-

ability that the elements of the cluster C

assigned to

the correct class A

. N denotes the total number of

elements and N

shows the total number of elements

assigned to the clusterC

. The value of E ranges from

0 to 1, and the smaller value of E indicates better re-

sult. We chose the parameters k and d whose value

of E is smallest. For each cluster, count the numbers

for each category, and assigned the maximum num-

ber of category to each document in the cluster. If

there are more than two categories with the maximum

numbers, we assigned all of these categories to each

document in the cluster.

3.2 Temporal-based Feature Selection

The second step is to select a set of indepen-

dent/dependent terms from the training data obtained

by the ﬁrst step, collection of documents by LDA.

The selection is based on the use of feature selec-

tion technique. We tested different feature selection

techniques, χ

statistics, mutual information, and in-

formation gain (Yang and Pedersen, 1997; Forman,

2003). In this paper, we report only χ

statistics that

optimized global F-score in classiﬁcation. χ

is given

by:

(t,C) =

n× (ad − bc)

(a+ c) × (b+ d) × (a+ b)× (c+ d)

. (5)

Using the two-way contingency table of a term t and

a category C, a is the number of documents of C con-

taining the term t, b is the number of documents of

other class (not C) containing t, c is the number of

documents of C not containing the term t, and d is the

number of documents of other class not containing t.

n is the total number of documents.

We applied χ

statistics in two ways. The ﬁrst way

is to extract independent terms that are salient across

the full temporal range of training documents. For

each category C

(1 < i ≤ s), where s is the number of

categories, we collected all documents with the same

category across the full temporal range, and created a

set. The number of sets equals to the number of cat-

egories, s. The second way is to extract dependent

terms that are salient for a speciﬁc time period. It is

applied to the sets of documents with different years

in the same category. For a speciﬁc category C

, we

collected all documents within the same year, and cre-

ated a set. Thus, the number of sets equals to the num-

ber of different years in the training documents. We

selected terms whose χ

value is larger than a certain

threshold value and regarded these terms as indepen-

dent/dependent terms.

3.3 Document Categorization

So far, we made use of the maximum amount of

tagged old data in feature selection. The ﬁnal step

is document categorization by TbTL. We trained the

model and classiﬁed documents based on TrAdaBoost

(Dai et al., 2007). TrAdaBoost extends AdaBoost

(Freund and Schapire, 1997) which aims to boost the

accuracy of a weak learner by adjusting the weights

of training instances and learn a classiﬁer accord-

ingly. TrAdaBoost uses two types of training data.

One is so-called same-distribution training data that

has the same distribution as the test data. In general,

the quantity of these data is often limited. In con-

trast, another data so-called diff-distribution training

data whose distribution may differ from the test data

is abundant. The TrAdaBoost aims at utilizing the

diff-distribution training data to make up the deﬁcit

of a small amount of the same-distribution to con-

struct a high-quality classiﬁcation model for the test

data. TrAdaBoost is the same behavior as boosting

for same-distribution training data. The difference is

that for diff-distribution training instances, when they

are wrongly predicted, we assume that these instances

do not contribute to the accurate test data classiﬁ-

cation, and the weights of these instances decrease

in order to weaken their impacts. Dai et al. ap-

plied TrAdaBoost to three text data, 20 Newsgroups,

SRZZ, Reuters-21578 which have hierarchical struc-

tures. They split the data to generate diff-distribution

and same-distribution sets which contain data in dif-

ferent subcategories. Our temporal-based transfer

learning, TbTL is based on the TrAdaboost. The dif-

ference between TbTL and TrAdaBoost presented by

Dai et al. is that the initialization step and output the

ﬁnal hypothesis. The initialization step is to remove

outliers. The outliers (training instances) are often in-

cluded in the diff-distribution data itself, especially if

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Table 1: Categorization Results (Mainichi data).

Cat SVM/wo SVM/w bSVM/wo bSVM/w TrAdaB/wo TrAdaB/w TbTL/wo TbTL/w

International 0.543∗ 0.582∗ 0.546∗ 0.682∗ 0.667∗ 0.682∗ 0.675∗ 0.693

Economy 0.564∗ 0.594∗ 0.699∗ 0.702∗ 0.665∗ 0.702∗ 0.672∗ 0.712

Home 0.432∗ 0.502∗ 0.449∗ 0.692∗ 0.660∗ 0.703∗ 0.664∗ 0.720

Culture 0.082∗ 0.102∗ 0.158∗ 0.301∗ 0.459∗ 0.493 0.402∗ 0.482

Reading 0.468∗ 0.489∗ 0.563∗ 0.571∗ 0.662∗ 0.697 0.530∗ 0.682

Arts 0.353∗ 0.372∗ 0.387∗ 0.652∗ 0.656∗ 0.663∗ 0.664∗ 0.693

Sports 0.773∗ 0.782∗ 0.792∗ 0.802 0.657∗ 0.730 0.675∗ 0.810

Local news 0.623∗ 0.644∗ 0.643∗ 0.702 0.660∗ 0.700 0.667∗ 0.710

Macro Avg. 0.480∗ 0.508∗ 0.530∗ 0.638∗ 0.636∗ 0.671∗ 0.619∗ 0.688

∗ denotes that TbTL/w is statistical signiﬁcance t-test compared with the ∗ marked method, P-value ≤ 0.05

there are a large amount of diff-distribution data. As

a result, they affect overall performance of classiﬁ-

cation. We removed these outliers in the initializa-

tion step. The second difference is the output the ﬁnal

hypothesis. We empirically tested Output by both of

the TrAdaBoost proposed by Dai et al. (Dai et al.,

2007) and AdaBoost (Freund and Schapire, 1997),

and choose AdaBoost’s Output i.e. a hypothesis h

is created at each round by linearly combining the

weak hypotheses constructed so far h

, · ··, h

with

weights β

, · ··, β

as it was better to the result ob-

tained by TrAdaBoost, i.e. the hypothesis h

from the

⌈N/2 ⌉

iteration to the N

is voted in the experi-

ments. The temporal-based transfer learning, TbTL

based on TrAdaBoost is illustrated in Figure 1.

shows the diff-distribution training data that

= {(x

, c(x

))}, where x

∈ X

(i = 1, ···, n),

and X

refers to the diff-distribution instance space.

Similarly, Tr

represents the same-distribution train-

ing data that Tr

= {(x

, c(x

))}, where x

∈ X

(i = 1,

·· · , m), and X

refers to the same-distribution instance

space. n and m are the number of documents in T

and

, respectively. c(x) returns a label for the input in-

stance x. The combined training set T = {(x

,c(x

))}

is given by:



i = 1, ··· ,n

i = n+ 1, ··· ,n+ m

Steps 2, 3, and 4 of Initialization in Figure 1 are

the extraction of outliers that are different term dis-

tribution among diff-distribution training data. We

removed these training data from the original diff-

distribution training data Tr

, and used the remains

new

as in the input of TrAdaBoost. n

′

of TrAd-

aBoost in Figure 1 refers to the number of the remain-

ing diff-distribution training documents.

We used the Support Vector Machines (SVM) as

a learner. We represented each training and test doc-

ument as a vector, each dimension of a vector is an

Input {

The diff-distribution data Tr

, the same-distribution

data Tr

, and the maximum number of iterations N.

}

Output {

(x) =

∑

t=1

}

Initialization {

1. w

= 1/n.

2. Train a weak learner on the training set Tr

, and cre-

ate weak hypothesis h

: X → Y

3. Classify Tr

by h

4. Create a new diff-distribution training data set

new

where each element x

satisﬁes

∑

i=1

) − c(x

) | = 0.

5. w

= 1/(n

′

+ m).

// n

′

refers to the number of documents in Tr

new

}

TrAdaBoost {

For t = 1,··· ,N

1. Set P

= w

/ (

∑

′

i=1

2. Train a weak learner on the combined training set

new

and Tr

with the distribution P

, and create

weak hypothesis h

: X → Y

3. Calculate the error of h

on Tr

∑

′

i=n

′

·|h

)−c(x

∑

′

i=n

′

4. Set β

= ε

/ (1− ε

) and β = 1/(1+

2lnn

′

/N).

5. Update the new weight vector:

t+1



)−c(x

, 1 ≤ i ≤ n

′

−|h

)−c(x

′

+ 1 ≤ i ≤ n

′

+ m

}

Figure 1: Flow of the algorithm.

independent/dependent term appeared in the docu-

ment, and each element of the dimension is a term

frequency. We applied the algorithm shown in Figure

1. After several iterations, a learner model is created

by linearly combining weak learners, and a test docu-

ment is classiﬁed using a learner.

Temporal-based Feature Selection and Transfer Learning for Text Categorization

4 EXPERIMENTS

We evaluated our temporal-based term selection and

learning techniques by using the Mainichi Japanese

newspaper documents.

4.1 Experimental Setup

We used the Mainichi Japanese newspaper corpus

from 1991 to 2012. The corpus consists of 2,883,623

documents organized into 16 categories. We selected

8 categories, “International”, “Economy”, “Home”,

“Culture”, “Reading”, “Arts”, “Sports”, and “Local

news”, each of which has sufﬁcient number of docu-

ments. Table 2 shows statistics of the dataset.

Table 2: The data used in the experiments.

Cat Docs Cat Docs

International 91,882 Reading 17,418

Economy 96,745 Arts 29,645

Home 47,984 Sports 183,216

Culture 20,428 Local news 282,829

All documents were tagged by using a morpho-

logical analyzer Chasen (Matsumoto et al., 2000).

We used noun words as independent/dependent term

selection. The total number of documents assigned

to these categories are 770,147. For each category

within each year, we divided documents into three

folds: 10% of documents are used as labeled train-

ing data, 50% of documents are unlabeled training

data, and 40% of documents are used to test our clas-

siﬁcation method. For each year, we classiﬁed un-

labeled data into categories using labeled data with

LDA. We empirically selected values of two param-

eters, the number of classes k, and documents d, re-

spectively. k is searched in steps of 10 from 10 to 200,

and d is searched in steps of 100 from 100 to 500. As

a result, for each year, we set k and d to 20, and 700,

respectively.

We divided original labeled training data and la-

beled data obtained by LDA into ﬁve folds for each

year. The ﬁrst three folds are used in the TbFS, i.e.

we calculated χ

statistics using the ﬁrst fold, and the

second fold is used as a training data and the third fold

is used as a test data to estimate the numbers of inde-

pendent/dependent terms. The estimation was done

by using F-score. As a result of estimation, we used

35,000 independent terms for each of the 8 categories,

and 12,000 dependent terms for each of the 8 cate-

gories in each year. The last two folds are used to

train TbTL. For each category, we used 50 documents

as the same-distribution data. When the time differ-

ence between training and test data is more than one

year, we used the remains as diff-distribution data

We used SVM-light (Joachims, 1998) as a ba-

sic learner in the experiments. We compared our

method, TbTL with TbFS (TbTL/w) with seven

baselines: (1) SVM without TbFS (SVM/wo), (2)

SVM with TbFS (SVM/w), (3) biased-SVM (Liu

et al., 2003) without TbFS (bSVM/wo), (4) biased-

SVM with TbFS (bSVM/w)), (5) TrAdaBoost with-

out TbFS (TrAdaB/wo), (6) TrAdaBoost with TbFS

(TrAdaB/w), and (7) TbTL without TbFS (TbTL/wo).

The methods without TbFS, i.e. (1), (3), (5), and (7),

we used all noun words in the documents.

TrAdaBoost refers to the results obtained by orig-

inal TrAdaBoost presented by Dai et al. Biased-SVM

is known as the state-of-the-art SVMs method, and

often used for comparison (Elkan and Noto, 2008).

Similar to the SVM, for biased-SVM, we used the last

two folds as a training data, and classiﬁed test docu-

ments directly, i.e. we used closed data. We empiri-

cally selected values of two parameters, “c” (trade-off

between training error and margin) and “j”, i.e. cost

(cost-factor, by which training errors on positive ex-

amples) that optimized F-score obtained by classiﬁ-

cation of test documents. “c” is searched in steps of

0.02 from 0.01 to 0.61. Similarly, “j” is searched in

steps of 5 from 1 to 200. As a result, we set c and j to

0.03 and 4, respectively. To make comparisons fair,

all eight methods including our method are based on

linear kernel. Throughout the experiments, the num-

ber of iterations is set to 30.

4.1.1 Results

Categorization results for 8 categories(40% of the test

documents, i.e. 308,058 documents) are shown in Ta-

ble 1. Each value in Table 1 shows macro-averaged F-

score across 22 years. “Macro Avg” in Table 1 refers

to macro-averaged F-score across categories. The

results obtained by biased-SVM indicate the maxi-

mized F-score obtained by varying the parameters,

“c” and “j”. As can be seen clearly from Table 1,

the results with “TbTL/w” and “TrAdaB/w” were bet-

ter to the results obtained by “bSVM/w” except for

“Sports” and “Local news” in “TrAdaB/w”, although

“bSVM/w” in Table 1 was the result obtained by us-

ing the closed data. Moreover, the results obtained

by SVM with and without TbFS was the worst re-

sult among other methods. These observations show

that once the training data drive from a different time

period from the test data, the distributions of terms

between training and test documents are not identical.

When the creation time period of the training data is

the same as the test data, we used only the same-distribution

data.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Table 3: Sample results of term selection.

Sports International

ind. dep. (2000) ind. dep. (1997)

baseball Sydney president Tupac Amaru

win Toyota premier Lima

game HP army Kinshirou

competition hung-up power residence

championship Paku government Hirose

entry admission talk Huot

tournament game election MRTA

player Mita UN Topac

defeat Miyawaki politics impression

pro ticket military employment

title ready nation earth

ﬁnals Seagirls democracy election

league award minister supplement

ﬁrst game Gaillard North Korea East Europe

Olympic attackers chair bankruptcy

The overall performance with TbFS were better

to those without TbFS in all methods. This shows

that temporal-based term selection contributes clas-

siﬁcation performance. Table 3 shows the topmost

15 independent/dependent terms obtained by TbFS.

The categories are “Sports” and “International”. As

we can see from Table 3 that independent terms

such as “baseball” and “win” are salient terms of the

category “Sports” regardless to a time period. On

the other hand, “Miyawaki” listed in the dependent

terms. The term often appeared in the documents

from 1998 to 2000 because Miyawaki was a snow-

board player and he was on his ﬁrst world cham-

pionship title in Jan. 1998. Similarly, in the cate-

gory “International”, terms such as “UN” and “North

Korea” are listed in the independent terms, as they

often appeared in documents regardless of the time-

line. In contrast, “Tupac Amaru” and “MRTA” are

listed in the dependent terms. It is reasonable because

in this year, Tupac Amaru Revolutionary Movement

(MRTA) rebels were all killed when Peruvian troops

stormed the Japanese ambassador’s home where they

held 72 hostages for more than four months. These

observations support our basic assumption: there are

two types of salient terms, i.e. terms that are salient

for a speciﬁc period, and terms that are important re-

gardless of the timeline.

Figures 2 and 3 illustrate F-score with/without

TbFS against the temporal difference between train-

ing and test data. Both training and test data are the

documents from 1991 to 2012. For instance, “5” of

the x-axis in Figures 2 and 3 indicate that the test

documents are created 5 years later than the training

documents. We can see from Figures 2 and 3 that

the results with TbFS were better to those without

TbFS in all of the methods. Moreover, the result ob-

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

-25 -20 -15 -10 -5 0 5 10 15 20 25

F-score

Temporal distance

"SVM_w"

"bSVM_w"

"TrAdaB_w"

"TbTL_w"

Figure 2: Performance with TbFS against temporal dis-

tance.

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

-25 -20 -15 -10 -5 0 5 10 15 20 25

F-score

Temporal distance

"SVM_wo"

"bSVM_wo"

"TrAdaB_wo"

"TbTL_w"

Figure 3: Performance without TbFS against temporal dis-

tance.

tained by “TbTL/w” in Figure 2 was the best in all of

the temporal distances. There are no signiﬁcant dif-

ferences among three methods, “bSVM”, “TrAdaB”,

and “TbTL” when the test and training data are the

same time period in both of the Figures 2 and 3.

The performance of these methods including “SVM”

drops when the period of test data is far from the

training data. However, the performance of “TbTL”

was still better to those obtained by other methods.

This demonstrates that the algorithm which applies

temporal-basedfeature selection and learning is effec-

tive for categorization. Figure 4 shows the averaged

F-score of categories across full temporal range with

TbFS against the number of iterations. Although the

curves are not quite smooth, they converge around 25

iterations.

Finally, we tested how the use of LDA inﬂuences

the overall performance. Figure 5 illustrates F-score

of “TbTL/w” with and without LDA against the tem-

poral difference between training and test data. In

“TbTL/w” without LDA, we added 50% (393,759)

labeled documents to the original 10% (78,751) la-

Temporal-based Feature Selection and Transfer Learning for Text Categorization

0.5

0.52

0.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0 5 10 15 20 25 30

F-score

Number of iterations

"SVM_iter"

"bSVM_iter"

"TrAdaB_iter"

"TbTL_iter"

Figure 4: F-score with TbFS against the # of iterations.

0.55

0.6

0.65

0.7

0.75

0.8

-25 -20 -15 -10 -5 0 5 10 15 20 25

F-score

Temporal distance

"TbTL_wLDA"

"TbTL_woLDA"

Figure 5: F-score with/without LDA against temporal dis-

tance.

beled training documents. As we expected, the re-

sults obtained by “TbTL/w” without LDA were better

to those with LDA in every temporal distance, and

the averaged improvement of F-score across 22 years

was 3.5%(0.723-0.688). It is not surprising because

in “TbTL/w” without LDA, we used a large number

of labelled training documents, 472,510 documents

which are very expensive and time-consuming. In

contrast, in “TbTL/w” with LDA, we used 78,751 la-

beled documents across 22 years in all, the average

number of documents per year was 3,579 across eight

categories.

5 CONCLUSIONS AND FUTURE

WORK

We have developed an approach for text categoriza-

tion concerned with the impact that the variation of

the strength of term-category relationship over time.

The basic idea is to minimize the impact of tempo-

ral effects in both feature selection and learning tech-

niques. The results using Japanese Mainichi News-

paper corpus show that temporal-based feature selec-

tion and learning method works well for categoriza-

tion, especially when the creation time of the test data

differs greatly from the training data.

There are a number of interesting directions for

future work. We should be able to obtain further ad-

vantages in accuracy in independent/dependent term

selection by smoothing the term distributions such

as organization and person name terms through the

use of techniques such as Latent Semantic Analysis

(LSA) (Deerwester et al., 1990), Log-Bilinear Doc-

ument Model (Maas and Ng, 2010), and word2vec

(Mikolov et al., 2013). The quantity of the labeled

training documents affects the overall performance.

Dai et al. attempted to use Transductive Support

Vector Machines (Dai et al., 2007; Joachims, 1999).

However, they reported that the rate of convergence is

slow. This issue needs further investigation. We used

LDA to classify unlabeled documents into categories.

There are number of other topic models such as con-

tinuous time dynamic topic model (Wang et al., 2008)

and a biterm topic model (Yan et al., 2013). It is worth

trying to test these methods for further improvement.

ACKNOWLEDGEMENTS

The authors would like to thank the referees for their

comments on the earlier version of this paper. This

work was supported by the Grant-in-aid for the Japan

Society for the Promotion of Science (JSPS), No.

25330255.

REFERENCES

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

Dirichlet Allocation. Machine Learning, 3:993–1022.

Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain

Adaptation with Structural Correspondence Learning.

In Proc. of the Conference on Empirical Methods in

Natural Language Processing, pp. 120-128.

Dai, W., Yang, Q., Xue, G., and Yu, Y. (2007). Boosting for

Transfer Learning. In Proc. of the 24th International

Conference on Machine Learning, pp. 193-200.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,

T. K., and Hashman, R. (1990). Indexing by Latent

Semantic Analysis. American Society for Information

Science, 41(6):391–407.

Dumais, S. and Chen, H. (2000). Hierarchical Classiﬁca-

tion of Web Contents. In Proc. of the 23rd Annual In-

ternational ACM SIGIR Conference on Research and

Development in Information Retrieval, pp. 256-263.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval

Elkan, C. and Noto, K. (2008). Learning Classiﬁers from

Only Positiveand Unlabeled Data. In Proc. of the 14th

ACM SIGKDD International Conference on Knowl-

edge Discovery & Data Mining, pp. 213-220.

Folino, G., Pizzuti, C., and Spezzano, G. (2007). An

Adaptive Distributed Ensemble Approach to Mine

Concept-drifting Data Streams. In Proc. of the 19th

IEEE International Conference on Tools with Artiﬁ-

cial Intelligence, pp. 183-188.

Forman, G. (2003). An Extensive Empirical Study of Fea-

ture Selection Metrics for Text Classiﬁcation. Ma-

chine Learning Research, 3:1289–1305.

Freund, Y. and Schapire, R. E. (1997). A Decision-

Theoretic Generalization of On-Line Learning and an

Application to Boosting. Journal of Computer and

System Sciences, 55(1):119–139.

Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain

Adaptation for Large-Scale Sentiment Classiﬁcation:

A Deep Learning Approach. In Proc. of the 28th In-

ternational Conference on Machine Learning, pp. 97-

110.

Gopal, S. and Yang, Y. (2010). Multilabel Classiﬁcation

with Meta-level Features. In Proc. of the 33rd Annual

International ACM SIGIR Conference on Research

and Development in Information Retrieval, pp. 315-

322.

Hassan, S., Mihalcea, R., and Nanea, C. (2007). Random-

Walk Term Weighting for Improved Text Classiﬁca-

tion. In Proc. of the IEEE International Conference

on Semantic Computing, pp. 242-249.

He, D. and Parker, D. S. (2010). Topic Dynamics: An Alter-

native Model of Bursts in Streams of Topics. In Proc.

of the 16th ACM SIGKDD Conference on Knowledge

discovery and Data Mining, pp. 443-452.

Hofmann, T. (1999). Probabilistic Latent Semantic Index-

ing. In Proc. of the 22nd Annual International ACM

SIGIR Conference on Research and Development in

Information Retrieval, pp. 35-44.

III, H. D.(2007). Frustratingly Easy Domain Adaptation. In

Proc. of the 45th Annual Meeting of the Association of

computational Linguistics, pp. 256-263.

Joachims, T. (1998). SVM Light Support Vector Machine.

In Dept. of Computer Science Cornell University.

Joachims, T. (1999). Transductive Inference for Text Clas-

siﬁcation using Support Vector Machines. In Proc. of

16th International Conference on Machine Learning,

pp. 200-209.

Kerner, Y. H., Mughaz, D., Beck, H., and Yehudai, E.

(2008). Words as Classiﬁers of Documents accord-

ing to Their Historical Period and the Ethnic Origin of

Their Authors. Cymernetics and Systems, 39(3):213–

228.

Kleinberg, M. (2002). Bursty and Hierarchical Structure

in Streams. In Proc. of the Eighth ACM SIGKDD In-

ternational Conference on Knowledge Discovery and

Data Mining, pp. 91-101.

Klinkenberg, R. and Joachims, T. (2000). Detecting Con-

cept Drift with Support Vector Machines. In Proc. of

the 17th International Conference on Machine Learn-

ing, pp. 487-494.

Lazarescu, M. M., Venkatesh, S., and Bui, H. H. (2004).

Using Multiple Windows to Track Concept Drift. In-

telligent Data Analysis, 8(1):29–59.

Lewis, D. D. and Ringuette, M. (1994). Comparison of Two

Learning Algorithms for Text Categorization. In Proc.

of the ThirdAnnual Symposium on Document Analysis

and Information Retrieval, pp. 81-93.

Li, Y., Yang, M., and Zhang, Z. (2013). Scientiﬁc Articles

Recommendation. In Proc. of the ACM International

Conference on Information and Knowledge Manage-

ment CIKM 2013, pp. 1147-1156.

Liu, B., dai, Y., Li, X., Lee, W. S., and Yu, P. S. (2003).

Building Text Classiﬁers using Positive and Unla-

beled Examples. In Proc. of the ICDM’03, pp. 179-

188.

Maas, A. L. and Ng, A. Y. (2010). A probabilistic Model

for Semantic Word Vectors. NIPS, 10.

Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y.,

Matsuda, Y., Takaoka, K., and Asahara, M. (2000).

Japanese Morphological Analysis System Chasen Ver-

sion 2.2.1. In Naist Technical Report.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).

Efﬁcient Estimation of Word Representations in Vec-

tor Space. In Proc. of the International Conference on

Learning Representations Workshop.

Mourao, F., Rocha, L., Araujo, R., Couto, T., Goncalves,

M., and Jr., W. M. (2008). Understanding Temporal

Aspects in Document Classiﬁcation. In Proc. of the

1st ACM International Conference on Web Search and

Data Mining, pp. 159-169.

Murphy, J. (1999). Technical Analysis of the Financial Mar-

kets. Prentice Hall.

Raina, R., Ng, A. Y., and Koller, D. (2006). Constructing In-

formative Priors using Transfer Learning. In Proc. of

the 23rd International Conference on Machine Learn-

ing, pp. 713-720.

Salles, T., Rocha, L., and Pappa, G. L. (2010). Temporally-

aware Algorithms for Document Classiﬁcation. In

Proc. of the 33rd Annual International ACM SIGIR

Conference on Research and Development in Infor-

mation Retrieval, pp. 307-314.

Siao, M. and Guo, Y. (2013). Domain Adaptation for Se-

quence Labeling Tasks with a Probabilistic Language

Adaptation Model. In Proc. of the 30th International

Conference on Machine Learning, pp. 293-301.

Song, M., Heo, G. E., and Kim, S. Y. (2014). Analyzing

topic evolution in bioinformatics: Investigation of dy-

namics of the ﬁeld with conference data in dblp. Sci-

entometrics, 101(1):397–428.

Sparinnapakorn, K. and Kubat, M. (2007). Combining

Subclassiﬁers in Text Categorization: A DST-based

Solution and a Case Study. In Proc. of the 13th

ACM SIGKDD International Conference on Knowl-

edge Discovery and Data Mining, pp. 210-219.

Wang, C., Blei, D., and Heckerman, D. (2008). Continuous

Time Dynamic Topic Models. In Proc. of the 24th

Conference on Uncertainty in Artiﬁcial Intelligence,

pp. 579-586.

Xue, G. R., Dai, W., Yang, Q., and Yu, Y. (2008). Topic-

bridged PLSA for Cross-Domain Text Classiﬁcation.

Temporal-based Feature Selection and Transfer Learning for Text Categorization

In Proc. of the 31st Annual International ACM SIGIR

Conference on Research and Development in Informa-

tion Retrieval, pp. 627-634.

Yan, X., Guo, J., Lan, Y., and X.Cheng (2013). A Biterm

Topic Model for Short Texts. In Proc. of the 22nd In-

ternational Conference on World Wide Web, pp. 1445-

1456.

Yang, Y. and Pedersen, J. O. (1997). A Comparative

Study on Feature Selection in Text Categorization. In

Proc. of the 14th International Conference on Ma-

chine Learning, pp. 412-420.

KDIR 2015 - 7th International Conference on Knowledge Discovery and Information Retrieval