We also study the effect of number of latent
topics on the performance of our learning algorithm.
Table 4 summarizes the experimental results with
different number of topics.
Table 4: perplexities for different number of latent topics.
# latent topic T 4 8 12 16 20
|
I
D
| =500 ,
= 1 267 239 229 224 221
|
I
D
| =1300 ,
= 0.8 261 227 222 219 218
From Table 4 we notice that larger number of
topics can result in better performance, which is in
line with the results of prior arts. But the
performance does not improve much when
reaches 16. This trend is different from the results of
Liu (2007) and Tam (2005) where much larger
numbers of topics are applied (from 50 to 200 topics)
with word unigram models. The experiments also
preliminarily reveal that the setting of
has no
direct impact on the setting of
.
4 CONCLUSIONS
In this paper we proposed a novel semi-supervised
learning method for building domain-specific LMs.
The innovative aspects of our method are: the
learning strategy and the derivation of topic
distribution of an interested domain; the weighted
TD method for combined dataset of domain-specific
and general-domain data; the weighting strategy of
n-grams for domain-specific LMs. The whole
learning process is under the multi-conditional
learning scheme which can effectively balance the
influence of the domain-specific and general-domain
data. We conducted experiments on easy domain as
well as hard domain and the results show that the
proposed method is very effective. It can not only
achieve better performance than state-of-art method
can, it can also deliver better result than the
simulated supervised learning process does with the
present configuration.
As future works, we may extend the learning
strategy to other domains. We will also consider
using other topic modelling method to make the
learning method more effective.
REFERENCES
Druck, G., Pal, C., Zhu, X., McCallum, A., “Semi-
Supervised Classification with Hybrid Generative/
Discriminative Method”. KDD’07. August 12-25, CA
USA, 2007.
Gildea, D. and Hofmann, T., “Topic-based lan-guage
models using EM”, Proc. of Eurospeech. 1999.
Heidel, A., Chang, H.A. and Lee, L.S., “Language Model
Adaptation Using Latent Dirichlet Allocation and
Efficient topic Inference Algorithm”,
INTERSPEECH’2007.
Hofmann, T., “Unsupervised Learning by Probabilistic
Latent Semantic Analysis”, Machine Learning,
42,177-196,2001.
Hsu, B. J., and Glass, J., “N-gram Weighting: Reducing
Training Data Mismatch in Cross-Domain Language
Model Estimation”, p829-838, Proc. EMNLP’08, 2008.
Liu, F. and Liu, Y., “Unsupervised Language Model
Adaptation Incorporating Named Entity Information”,
ACL’2007, Prague, Czech Republic. 2007.
Liu, X., and Croft, W.B., “Cluster-Based Retrieval Using
Language Model” SIGIR’04, July 25-29, UK, 2004.
Nigam, K., McCallum, A.K., Thrun, S., and Mitchell,
T.M., “Text classification from labeled and unlabeled
documents using EM”, machine learning , 39, 103-134,
2000.
Sarikaya, R., Gravano, A. and Gao, Y., “Rapid language
model development using external resources for new
spoken dialogue domain”, ICASSP2005, 2005.
Sethy, A., Georgiou, P.G., and Narayanan, S., “Text data
acquisition for domain-specific language models”
p382-389, EMNLP 2006.
Tam, Y. and Schultz, T., “Dynamic Language Model
Adaptation using Variational Bayes Inference”,
INTERSPEECH’05, 2005.
Wan, V., Hain, T., “strategies for language model web-
data collection”, ICASSP’2006, 2006.
Xue, G.R., Dai, W.Y., Yang, Q.and Yi, Y., “Topic-
bridged PLSA for cross-domain text classification”,
SIGIR'08 July20-24, 2008, Singapore.