SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC

LANGUAGE MODELS FROM GENERAL DOMAIN DATA

Shuanhu Bai and Haizhou Li

Institute for Infocomm Research, 1 Fusinaopolis Way, #21-01 Connexis, Singapore 138632

{sbai, hli}@i2r.a-star.edu.sg

Keywords: Machine learning, Natural language processing, Language model, Topic model.

Abstract: We present a semi-supervised learning method for building domain-specific language models (LM) from

general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap

domain-specific resources residing in larger amount of general-domain data with the help of topic modelling

technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of

domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it

derives weighted domain-specific word n-gram counts with mixture modelling scheme of PLSA. Finally, it

uses traditional n-gram modelling approach to construct domain-specific LMs from the domain-specific

word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art

methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised

learning method can achieve better performance even with very small amount of domain-specific data.

1 INTRODUCTION

LMs are widely used in various natural language

processing (NLP) applications such as text mining,

machine translation and speech recognition systems.

AS we know that the performances of LMs are

highly domain-dependent. Traditional approach

alleviates this problem by adapting a background

model with the data collected from the similar

application domains, which is called domain

adaptation. In order to achieve better performance,

domain specific application systems usually require

LMs to be built fully with domain-specific data for a

very particular task. Existing learning algorithms for

building these LMs rely heavily on the availability

of high-quality domain-specific data. Collecting

enough such data in most cases is not an easy task,

and it is much more difficult to achieve when we are

presented with a new domain. On the other hand, we

may have large amount of general-domain data on

hand. Tapping into such cheaper resources to

alleviate the shortage of domain-specific data seems

to be a good choice.

The efforts for building domain-specific LMs

have been mostly spent on the issue of obtaining

training texts from sources such as the Web

(Sarikaya et al., 2005; Sethy et al., 2006; Wan and

Hain, 2006). Although much of unnecessary data

can be filtered out by search engines, the data

collected from the Web is still far from direct use.

Some semi-supervised methods are employed to

identify useful sentences using selection criterions

such as BiLingual Evaluation Understudy (BLEU)

(Sarikaya et al., 2005) and relative entropy (Sethy et

al., 2006). Text data selection schemes can be

regarded as text/sentence classification methods.

Texts or sentences falling into the domain class are

used for LM training. In a similar effort, Liu and

Croft (2004) create cluster-based LMs for

clustering-based retrieval using clustered corpus.

Recently, topic modelling methods have been

introduced into language modelling efforts for

unsupervised topic adaptation (Gildea and Hof-

mann, 1999; Tam and Schultz, 2005; Heidel et al.,

2007; Liu, 2007). Although these methods can not

be directly used for generating high performance

domain-specific LMs, but their basic idea of using

latent topic as means to tap into domain-specific

knowledge is heuristic for our semi-supervised

learning. On the other hand, semi-supervised

learning techniques have been successfully used in

other NLP tasks such as text classification systems

using small number of labelled texts and larger pools

of unlabeled texts (Nigam et al.,2000; Druck et al.,

2007; Xue et al., 2008).

In this paper, we focus the problem of learning

domain-specific LMs from less domain-specific data

in a semi-supervised manner. Given a small domain-

specific dataset

, we are going to build a domain-

specific LM by taping into available larger general-

domain dataset

for “useful” information. Our

goal is to utilize this information in

to composite

for the insufficiency of domain data

to build

higher performance LMs. The key idea of our

approach is to use latent topics of a topic model as a

means of learning to obtain domain-specific

language use knowledge from general-domain data.

We can make use of the TD mechanisms of a topic

model to derive document-dependent topic

distribution of the documents in the training sets.

Their topic distributions may appear similarity to

some extent if documents from

and

some similar topics. Thus the topic distributions of

the documents actually provide a bridge between

dataset

and

. It allows us to derive language

uses such as n-grams in

that are highly

associated with topics that

prefers. Domain

specific n-gram data can further be obtained from

topic-specific data and topic-mixture modelling

scheme of PLSA. The major advantage of this

learning approach is that it is able to yield high

performance LMs with very small amount of

domain-specific data.

The rest of the paper is organized as follows:

Section 2 will be dedicated to detailed introduction

to the modelling methods as well as the learning

algorithm. We show the experimental results in

section 3, and conclude our paper with section 4.

2 LEARNING METHOD

2.1 Learning Strategies

In this section we introduce our semi-supervised

learning method with the help of PLSA topic

modeling technologies (Hofmann, 1999, 2001).

Under the framework of PLSA, word

distribution

in document

can be described as mixture of latent

topics

t :

)|()|()|( twpdtpdwp

∑=

(1)

where

)|( dtp

represents topic distribution of the

documents, while

)|( twp

represents mixture

components in the form of word unigram model.

Both

)|( dtp

and

)|( twp

can be obtained by applying

expectation maximization (EM) algorithm on the

likelihood of a document collection. The training

process is referred to as topic decomposition (TD).

Suppose we have a combined dataset

DDD ∪=

, we

treat

not only as a set of documents, but as a

domain as well. After applying TD on data

, we

can approximate latent topic distribution of

treating

as a single document, which can be

expressed as:

),|'(),(

),|(),(

)|(

',,

wdtpwdn

Dtp

tdw

∑

(2)

where

represents the elements of

and

),( wdn

is the count of word

w in document

. In PLSA

),|( wdtp

is interpreted as the probability of topic t is

used by document

to generate word w , thus the

term

),|(),( wdtpwdn

is the number of times

topic t is used by

for generating w . We can use

the E step of TD method (Hofmann, 2001) to

evaluation of

),|( wdtp

Topic distribution

)|(

Dtp

can be regarded as the

latent topic preference of the domain represented

. Since the topic-specific LMs will be working

in the same domain as

represents, we assume that

the topic distribution of incoming documents could

be simply modelled by

)|(

Dtp

during decoding.

Therefore, the domain-specific LM can be expressed

as:

),|()|()|( thwpDtphwp

∑=

(3)

where

),|( thwp

are topic-specific n-gram models.

It is more reasonable to assume that

),|( thwp

are

word unigram models because that is the assumption

of PLSA. Latent topics here only serve as

intermediate variables in building the domain-

specific models and are summed out afterwards.

Now the problem becomes the issue of TD and

derivation of high-order n-gram models

),|( thwp

2.2 Weighted Topic Decomposition

We know that the parameters of PLSA topic models

are estimated from the entire dataset through EM. As

it is assumed at the very beginning, we may have

only a few domain-specific texts available, and there

are plenty general-domain texts on hand. If we feed

the combined dataset into the training process

indiscriminately, the determining force for

parameter estimation will be dominated by the

general-domain data and the effect of domain-

specific data may be ignored. The solution for such

problems is to use multi-conditional learning scheme

(Druck et al., 2007) where a weighted objective

function can be used, which can be specified as:

);();()( ΘΘ=Θ

DPDPO

(4)

where

);( Θ

and

);( Θ

represent the likelihood

of domain-specific data

and general-domain data

. A new parameter

introduced into the

likelihood function, can decrease the contribution of

the general-domain data to parameter estimation

when we choose

≤

. In practice, it is

convenient to maximize the log-likelihood of

);(log);(log)(log Θ+Θ=Θ

DPDPO

(5)

It is obvious that the different likelihoods of

and

share the same set of parameters. The

learning objective is to choose the model parameters

that maximize the log likelihood. When we apply

this learning strategy to PLSA framework specified

by Eq.(1), the log-likelihood of general-domain data

in Eq.(5) can be expanded as:

)|(log),():(log

dwpwdnDP

∑∑=Θ

(6)

where

Dd ∈

),( wdn

is the count of word w in

document

and

)|(

dtp

can further be expanded

by Eq.(1). Most importantly, we notice that

parameter

is always coupled to the counts

),( wdn

in the log-likelihood function, therefore it can be

regarded as a weighting factor for the document-

word counts of the general-domain data. We can use

a revised EM algorithm for PLSA model training,

which is very much similar to EM

−

of Nigam

(2000).

2.3 Weighted N-gram Counts

PLSA topic model is known as a mixture model and

its mixture components are word unigram models.

Our objective is to derive high-order mixture

components for better performances. A direct

solution for building high-order mixture components

is to derive topic-specific n-gram counts first. Then

the topic-specific n-gram models can be constructed

with conventional n-gram modelling method from

the counts. For the convenience of later discussion,

we use

to represent a word n-gram sequence.

Here

stands for a word sequence of length

−

, it

becomes an empty string when word unigram is

represented,

is an arbitrary word. Given a

document set

and

Dd∈

, if we take the view that

)|( dtp

is the result of soft classification of the

documents in

, then the count of n-gram

the training corpus with respect to latent

topic

),( thwc

becomes

)|(),( dtphwdc

∑

. As

)|( twp

got from TD are considered to be better optimized,

thus we use PLSA topic modelling assumptions to

derive topic-specific word unigram counts

),( twc

with

),|(),( wdtpwdc

∑

, where

),|( wdtp

is the

probability that topic

is used by document

for

generating word

. By taking the weighting

factor

for datasets

and

into consideration,

can be express as:

),|(),()(

)|(),()(

),(

⎩

⎨

⎧

∑

wdtphwdcd

dtphwdcd

thwc

(7)

where

),( hwdc

is the original count of n-gram

in document

)(d

will be the weighting factor

whenever

is in

, and will be 1

whenever

is in

After topic-specific n-gram counts have been

derived, we can estimate topic-specific n-gram

model parameters with maximum likelihood

approach, which can be expressed as:

),(/),(),|( thcthwcthwp ⋅

(8)

where

),( thn

⋅

represents the count of the word

sequence

followed by any word. By using Eq.(3)

and Eq.(8), we can further specify the domain-

specific model as follows:

),(/),()|()|( thcthwcDtphwp

⋅∑=

),(

),()|(

),(

thwc

thc

khcDtp

khn

⋅

∑

⋅

),(),(

),(

thwcth

khc

∑

⋅

(9)

where

theoretically can be any of

. For the

convenience of later smoothing, we let

),( khc ⋅

be the

mean of

),( thc

⋅

to obtain the domain-specific n-gram

counts

),(),( thwcth

∑

, which can be regarded as

the mixture of topic-specific n-gram counts

),( thwn

with mixture weights:

),(/),()|(),( thckhcDtpth

⋅⋅=

(10)

Therefore, the modelling effort is changed from

mixture of probabilities to mixture of counts. That is,

instead of estimating the probability parameters for

each of the component models, we can conduct

count merging first. Thus we can save the smoothing

efforts for each individual topic-specific model. This

process can also be regarded as an n-gram weighting

scheme (Hsu, 2008) using topic distribution of

documents and topic distribution of domain.

Afterwards, the domain-specific models can be built

from the final counts by applying smoothing

methods such as cut-off and back-off technologies.

3 EXPERIMENTS

3.1 Datasets

Our experiments are carried out with part of LDC

corpus NA_News98 and some data from

20Newsgroups(http://people.csail.mit.edu/jrennie/20

Newsgroups/ , version 20news_bydate).

Table 1: Data sets used in the experiments.

Data Source # of Docs

NA_NEWS98 part of

NYT/1997

Total: 106,431

cat=’s’: 13,239

Easy

NA_NEWS98 part of

NYT/1998 with cat=‘s’

100 / 500 / 900 /

1300 / 1700

Hard

20NEWSGROUP in

category ‘sci.med’

400

Easy

test set

NA_NEWS98 part of

NYT/1998 with cat=‘s’

500

Hard

test set

20NEWSGROUP

in category ‘sci.med’

100

Table 1 shows the structure of the datasets used

in our experiments. Because texts of NA_News98

are well categorized into different domains, it

enables us to conduct simulation experiments using

comparatively larger scale datasets. Dataset

built by randomly choosing texts from subset

NYT/1997, it consists of total 106,431 documents,

among which 13,239 documents are in the category

of ‘s’ (sports). We compile two groups of

, one is

referred to as “easy domain” where the documents

are selected from subset NYT/98 with the same

category ‘s’. We call it “easy domain” because

does contain documents of the same domain as

does. In order to study the relationship between the

amount of domain-specific data and performance of

our learning algorithm,

s in this group are created

in different size. We also compiled a

of a “hard

domain” form 20Newsgroups of the category

‘sci.med’ standing for medical domain. We call it

hard domain because

hardly contains documents

of the same domain as

does. Given

datasets

and

, document-word tables for TD are

built by applying a stop-word list around 500 entries

and words with original counts of less than 3 are not

used. The vocabulary of the most frequent 60K

words for each model is selected from the weighted

counts when semi-supervised learning method is

used. We use word tri-gram models in our

experiments unless it is specially mentioned.

3.2 Experiments on Easy Domain

To investigate whether the learning algorithm can

effectively tap into the domain knowledge residing

in general-domain data, we create domain-specific

document sets

in the domain of sports in

different size, as is indicated by Easy

in Table 1.

For comparison purpose we build different models

from both

and

using different approaches.

Figure 1 shows the perplexity test results. Model of

supervised learning means it is built from a

well as the documents of category ‘s’ in

. It is

simply used to simulate manual data collection

process. Model of domain adaptation represents it is

the result of liner interpolation of the background

model built form

and the domain-specific model

built from a

. Model of relative entropy means it is

created with relative entropy text selection scheme

(Sethy, 2007) which extract relevant documents

from

with a bootstrap model built from a

Model of semi-supervise learning means it is built

with our approach as is described in Algorithm 1

with the setting of 8 latent topics and optimized

From Figure 1 we can see that our semi-

supervised learning method can easily outperform

other three approaches in terms of test set perplexity

reduction. The models yielded through domain

adaptation approach perform the worst. This means

that domain adaptation is not a good way to create

domain-specific models. The performance of the

models built with relative entropy text selection

criteria is between that of supervised learning and

domain adaptation. We found that it is still very

important for this method to have a bootstrap model

that is built from sizeable balanced data.

220

230

240

250

260

270

280

290

100 500 900 1300 1700

number of domain-specific documents

perplexity

supervised learning domain adaptation

semi-supervised learning relative entropy

Figure 1: Perplexity test results of different learning

approaches with different size of

We also notice that our semi-supervised learning

method outperforms supervised learning method

with present configuration. The reason could be that

the documents in the training set probably are not

well categorized, there exist some texts that should

be classified as category sport, our learning

algorithm can manage to find out and make use of

them. The simulated supervised learning method, on

the other hand, is not able to retrieve such texts

using simple string matching mechanism and the

collected training data is not enough. Another reason

may be due to the fact that the category boundaries

of texts from news media are not so clear. Our

algorithm may somehow borrow some information,

such as topic-specific word co-occurrence

preferences that an n-gram model wants to capture,

from texts of other domains.

In addition, our semi-supervised learning

algorithm works well with smaller domain-specific

dataset. Particularly, with the

of only 100 texts,

which is only about 0.1% of the size of

, we can

yield models with higher performances than those

created by other three methods. Its perplexity of

248.3 at this point is the lowest comparing with

271.4 of the second best performing method of

supervised learning at the same point and 251 at

1,700.

3.3 Experiments on Hard Domain

It is much more important if it is able to learn

knowledge from general-domain data for hard

domains. In order to carry out such experiment, we

create a domain-specific dataset hard

as well as a

hard test set, as are shown in Table 1. The

experimental results are presented in Table 3.

Table 3: Experimental results on hard domain.

Modeling methods Perplexity OOV rate

Baseline built from

533 5.4%

Baseline(

)+Domain(

) 359 3.3%

SSL with T=12,

=1.0 317 3.6%

SSL with T=12,

=0.8 281 2.8%

SSL with T=12,

=0.6 263 2.1%

SSL with T=12,

=0.4 292 3.4%

We observe form Table 3 that there exists

significant domain mismatch between

and the

test set. Without the help of

, the baseline model

built from

alone generate an astonishing

perplexity of 533 and out-of-vocabulary (OOV) rate

of 5.4% against the test set. Domain adaptation with

liner interpolation method can achieve significant

perplexity reduction even with a very small amount

of data. But our semi-supervised learning (SSL)

method can make further improvement. Contrary to

the phenomena in previous experiment, our method

can only reach optimal state by setting

with

smaller value of 0.6. This can be explained by the

fact that

does not clearly contain much domain-

specific data, we need to decrease its influence over

to find useful information during DT and n-gram

model training. We also notice that both perplexity

number and OOV rate are consistently improving

when we decrease

from 1 to 0.6. This means that

the learning algorithm is able to extract more helpful

information from the training corpus through

appropriately re-weighting of the datasets.

3.4 Experiments on Parameter Setting

As mentioned earlier, there are two free parameters

need to be set for our learning algorithm: the number

of latent topics

and the weighting factor

for

general-domain data. In order to investigate how

these parameters affect the learning performance, we

conduct experiments on different parameter setting.

Figure 2 shows the performance of the models

obtained with different weighting factor

and 8

latent topics. The experiments are carried out with

in different size. We observe that our learning

algorithm can hardly achieve best performance with

the setting of

. Therefore the weighting factor

is important for us to derive higher-performance

models. In particular, model obtained with

size 100 become little bit worse off with perplexity

of 249 when

is set 0.8 comparing perplexity of

248.3 when

is set 1. On the other hand, the

overall trend becomes worse when

is set with 0.4

or below. This can be explained by the fact that

excessively lowering the weighting for

equals to

intensively raising the weighting for

220

230

240

250

260

270

280

290

1 0.8 0.6 0.4 0.2

weight for general purpose data

perplexity

size 100

size 500

size900

size 1300

Figure 2: Perplexity test results of different models

obtained with different size of

and different

We also study the effect of number of latent

topics on the performance of our learning algorithm.

Table 4 summarizes the experimental results with

different number of topics.

Table 4: perplexities for different number of latent topics.

# latent topic T 4 8 12 16 20

| =500 ,

= 1 267 239 229 224 221

| =1300 ,

= 0.8 261 227 222 219 218

From Table 4 we notice that larger number of

topics can result in better performance, which is in

line with the results of prior arts. But the

performance does not improve much when

reaches 16. This trend is different from the results of

Liu (2007) and Tam (2005) where much larger

numbers of topics are applied (from 50 to 200 topics)

with word unigram models. The experiments also

preliminarily reveal that the setting of

has no

direct impact on the setting of

4 CONCLUSIONS

In this paper we proposed a novel semi-supervised

learning method for building domain-specific LMs.

The innovative aspects of our method are: the

learning strategy and the derivation of topic

distribution of an interested domain; the weighted

TD method for combined dataset of domain-specific

and general-domain data; the weighting strategy of

n-grams for domain-specific LMs. The whole

learning process is under the multi-conditional

learning scheme which can effectively balance the

influence of the domain-specific and general-domain

data. We conducted experiments on easy domain as

well as hard domain and the results show that the

proposed method is very effective. It can not only

achieve better performance than state-of-art method

can, it can also deliver better result than the

simulated supervised learning process does with the

present configuration.

As future works, we may extend the learning

strategy to other domains. We will also consider

using other topic modelling method to make the

learning method more effective.

REFERENCES

Druck, G., Pal, C., Zhu, X., McCallum, A., “Semi-

Supervised Classification with Hybrid Generative/

Discriminative Method”. KDD’07. August 12-25, CA

USA, 2007.

Gildea, D. and Hofmann, T., “Topic-based lan-guage

models using EM”, Proc. of Eurospeech. 1999.

Heidel, A., Chang, H.A. and Lee, L.S., “Language Model

Adaptation Using Latent Dirichlet Allocation and

Efficient topic Inference Algorithm”,

INTERSPEECH’2007.

Hofmann, T., “Unsupervised Learning by Probabilistic

Latent Semantic Analysis”, Machine Learning,

42,177-196,2001.

Hsu, B. J., and Glass, J., “N-gram Weighting: Reducing

Training Data Mismatch in Cross-Domain Language

Model Estimation”, p829-838, Proc. EMNLP’08, 2008.

Liu, F. and Liu, Y., “Unsupervised Language Model

Adaptation Incorporating Named Entity Information”,

ACL’2007, Prague, Czech Republic. 2007.

Liu, X., and Croft, W.B., “Cluster-Based Retrieval Using

Language Model” SIGIR’04, July 25-29, UK, 2004.

Nigam, K., McCallum, A.K., Thrun, S., and Mitchell,

T.M., “Text classification from labeled and unlabeled

documents using EM”, machine learning , 39, 103-134,

2000.

Sarikaya, R., Gravano, A. and Gao, Y., “Rapid language

model development using external resources for new

spoken dialogue domain”, ICASSP2005, 2005.

Sethy, A., Georgiou, P.G., and Narayanan, S., “Text data

acquisition for domain-specific language models”

p382-389, EMNLP 2006.

Tam, Y. and Schultz, T., “Dynamic Language Model

Adaptation using Variational Bayes Inference”,

INTERSPEECH’05, 2005.

Wan, V., Hain, T., “strategies for language model web-

data collection”, ICASSP’2006, 2006.

Xue, G.R., Dai, W.Y., Yang, Q.and Yi, Y., “Topic-

bridged PLSA for cross-domain text classification”,

SIGIR'08 July20-24, 2008, Singapore.