SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA
Shuanhu Bai, Haizhou Li
2009
Abstract
We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modelling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modelling scheme of PLSA. Finally, it uses traditional n-gram modelling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.
References
- Druck, G., Pal, C., Zhu, X., McCallum, A., “SemiSupervised Classification with Hybrid Generative/ Discriminative Method”. KDD'07. August 12-25, CA USA, 2007.
- Gildea, D. and Hofmann, T., “Topic-based lan-guage models using EM”, Proc. of Eurospeech. 1999.
- Heidel, A., Chang, H.A. and Lee, L.S., “Language Model Adaptation Using Latent Dirichlet Allocation and Efficient topic Inference Algorithm”, INTERSPEECH'2007.
- Hofmann, T., “Unsupervised Learning by Probabilistic Latent Semantic Analysis”, Machine Learning, 42,177-196,2001.
- Hsu, B. J., and Glass, J., “N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation”, p829-838, Proc. EMNLP'08, 2008.
- Liu, F. and Liu, Y., “Unsupervised Language Model Adaptation Incorporating Named Entity Information”, ACL'2007, Prague, Czech Republic. 2007.
- Liu, X., and Croft, W.B., “Cluster-Based Retrieval Using Language Model” SIGIR'04, July 25-29, UK, 2004.
- Nigam, K., McCallum, A.K., Thrun, S., and Mitchell, T.M., “Text classification from labeled and unlabeled documents using EM”, machine learning , 39, 103-134, 2000.
- Sarikaya, R., Gravano, A. and Gao, Y., “Rapid language model development using external resources for new spoken dialogue domain”, ICASSP2005, 2005.
- Sethy, A., Georgiou, P.G., and Narayanan, S., “Text data acquisition for domain-specific language models” p382-389, EMNLP 2006.
- Tam, Y. and Schultz, T., “Dynamic Language Model Adaptation using Variational Bayes Inference”, INTERSPEECH'05, 2005.
- Wan, V., Hain, T., “strategies for language model webdata collection”, ICASSP'2006, 2006.
- Xue, G.R., Dai, W.Y., Yang, Q.and Yi, Y., “Topicbridged PLSA for cross-domain text classification”, SIGIR'08 July20-24, 2008, Singapore.
Paper Citation
in Harvard Style
Bai S. and Li H. (2009). SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA . In - KDIR, (IC3K 2009) ISBN , pages 0-0
in Bibtex Style
@conference{kdir09,
author={Shuanhu Bai and Haizhou Li},
title={SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA},
booktitle={ - KDIR, (IC3K 2009)},
year={2009},
pages={},
publisher={SciTePress},
organization={INSTICC},
doi={},
isbn={},
}
in EndNote Style
TY - CONF
JO - - KDIR, (IC3K 2009)
TI - SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA
SN -
AU - Bai S.
AU - Li H.
PY - 2009
SP - 0
EP - 0
DO -