SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA

Shuanhu Bai, Haizhou Li

Abstract

We present a semi-supervised learning method for building domain-specific language models (LM) from general-domain data. This method is aimed to use small amount of domain-specific data as seeds to tap domain-specific resources residing in larger amount of general-domain data with the help of topic modelling technologies. The proposed algorithm first performs topic decomposition (TD) on the combined dataset of domain-specific and general-domain data using probabilistic latent semantic analysis (PLSA). Then it derives domain-specific word n-gram counts with mixture modelling scheme of PLSA. Finally, it uses traditional n-gram modelling approach to construct domain-specific LMs from the domain-specific word n-gram counts. Experimental results show that this approach can outperform both stat-of-the-art methods and the simulated supervised learning method with our data sets. In particular, the semi-supervised learning method can achieve better performance even with very small amount of domain-specific data.

References

  1. Druck, G., Pal, C., Zhu, X., McCallum, A., “SemiSupervised Classification with Hybrid Generative/ Discriminative Method”. KDD'07. August 12-25, CA USA, 2007.
  2. Gildea, D. and Hofmann, T., “Topic-based lan-guage models using EM”, Proc. of Eurospeech. 1999.
  3. Heidel, A., Chang, H.A. and Lee, L.S., “Language Model Adaptation Using Latent Dirichlet Allocation and Efficient topic Inference Algorithm”, INTERSPEECH'2007.
  4. Hofmann, T., “Unsupervised Learning by Probabilistic Latent Semantic Analysis”, Machine Learning, 42,177-196,2001.
  5. Hsu, B. J., and Glass, J., “N-gram Weighting: Reducing Training Data Mismatch in Cross-Domain Language Model Estimation”, p829-838, Proc. EMNLP'08, 2008.
  6. Liu, F. and Liu, Y., “Unsupervised Language Model Adaptation Incorporating Named Entity Information”, ACL'2007, Prague, Czech Republic. 2007.
  7. Liu, X., and Croft, W.B., “Cluster-Based Retrieval Using Language Model” SIGIR'04, July 25-29, UK, 2004.
  8. Nigam, K., McCallum, A.K., Thrun, S., and Mitchell, T.M., “Text classification from labeled and unlabeled documents using EM”, machine learning , 39, 103-134, 2000.
  9. Sarikaya, R., Gravano, A. and Gao, Y., “Rapid language model development using external resources for new spoken dialogue domain”, ICASSP2005, 2005.
  10. Sethy, A., Georgiou, P.G., and Narayanan, S., “Text data acquisition for domain-specific language models” p382-389, EMNLP 2006.
  11. Tam, Y. and Schultz, T., “Dynamic Language Model Adaptation using Variational Bayes Inference”, INTERSPEECH'05, 2005.
  12. Wan, V., Hain, T., “strategies for language model webdata collection”, ICASSP'2006, 2006.
  13. Xue, G.R., Dai, W.Y., Yang, Q.and Yi, Y., “Topicbridged PLSA for cross-domain text classification”, SIGIR'08 July20-24, 2008, Singapore.
Download


Paper Citation


in Harvard Style

Bai S. and Li H. (2009). SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA . In - KDIR, (IC3K 2009) ISBN , pages 0-0


in Bibtex Style

@conference{kdir09,
author={Shuanhu Bai and Haizhou Li},
title={SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA},
booktitle={ - KDIR, (IC3K 2009)},
year={2009},
pages={},
publisher={SciTePress},
organization={INSTICC},
doi={},
isbn={},
}


in EndNote Style

TY - CONF
JO - - KDIR, (IC3K 2009)
TI - SEMI-SUPERVISED LEARNING OF DOMAIN-SPECIFIC LANGUAGE MODELS FROM GENERAL DOMAIN DATA
SN -
AU - Bai S.
AU - Li H.
PY - 2009
SP - 0
EP - 0
DO -