Distance Based Active Learning for Domain Adaptation

Christian Pölitz

Abstract

We investigate methods to apply Domain Adaptation coupled with Active Learning to reduce the number of labels needed to train a classifier. We assume to have a classification task on a given unlabelled set of documents and access to labels from different documents of other sets. The documents from the other sets come from different distributions. Our approach uses Domain Adaptation together with Active Learning to find a minimum number of labelled documents from the different sets to train a high quality classifier. We assume that documents from different sets that are close in a latent topic space can be used for a classification task on a given different set of documents.

References

  1. Maria-Florina Balcan, Andrei Broder, and Tong Zhang. Margin based active learning. In Proceedings of the 20th annual conference on Learning theory, COLT'07, pages 35-50, Berlin, Heidelberg, 2007. Springer-Verlag.
  2. John Blitzer, Ryan McDonald, and Fernando Pereira. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP 7806, pages 120-128, Stroudsburg, PA, USA, 2006. Association for Computational Linguistics.
  3. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, March 2003.
  4. Yee Seng Chan and Hwee Tou Ng. Domain adaptation with active learning for word sense disambiguation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 49-56, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
  5. Hal Daumé, III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26(1):101-126, May 2006.
  6. Susan T. Dumais. Latent semantic analysis. Annual Review of Information Science and Technology, 38(1):188- 230, 2004.
  7. Wenyuan Dai, Gui-Rong Xue, Qiang Yang, and Yong Yu. Co-clustering based classification for out-of-domain documents. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7807, pages 210-219, New York, NY, USA, 2007. ACM.
  8. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Suppl. 1):5228-5235, April 2004.
  9. Thorsten Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
  10. Jing Jiang and Chengxiang Zhai. Instance weighting for domain adaptation in nlp. In Proceedings of the Association for Computational Linguistics, ACL'07, pages 264-271, 2007.
  11. S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79-86, 03 1951.
  12. David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, ICML'94, pages 148-156. Morgan Kaufmann, 1994.
  13. Chunyong Luo, Yangsheng Ji, Xinyu Dai, and Jiajun Chen. Active learning with transfer learning. In Proceedings of ACL 2012 Student Research Workshop, pages 13-18, Jeju Island, Korea, July 2012. Association for Computational Linguistics.
  14. Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Journal Machine Learning, 46(1-3):191-202, March 2002.
  15. Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical Association, 95(449):pp. 135-143, 2000.
  16. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61-74. MIT Press, 1999.
  17. Burr Settles. Active Learning Literature Survey. Technical Report 1648, University of Wisconsin-Madison, 2009.
  18. Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert Müller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985-1005, December 2007.
  19. Avishek Saha, Piyush Rai, Hal Daumé, Suresh Venkatasubramanian, and Scott L. DuVall. Active supervised domain adaptation. In Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III, ECML PKDD'11, pages 97-112, Berlin, Heidelberg, 2011. Springer-Verlag.
  20. Si Si, Dacheng Tao, and Bo Geng. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 22(7):929-942, 2010.
  21. Masashi Sugiyama, Makoto Yamada, Paul von Bnau, Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Networks, 24(2):183 - 198, 2011.
  22. Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. Neural Computation, 25(5):1324-1370, 2013.
Download


Paper Citation


in Harvard Style

Pölitz C. (2015). Distance Based Active Learning for Domain Adaptation . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-076-5, pages 296-303. DOI: 10.5220/0005217302960303


in Bibtex Style

@conference{icpram15,
author={Christian Pölitz},
title={Distance Based Active Learning for Domain Adaptation},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2015},
pages={296-303},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005217302960303},
isbn={978-989-758-076-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Distance Based Active Learning for Domain Adaptation
SN - 978-989-758-076-5
AU - Pölitz C.
PY - 2015
SP - 296
EP - 303
DO - 10.5220/0005217302960303