Cross-domain Text Classification through Iterative Refining of Target Categories Representations
Giacomo Domeniconi, Gianluca Moro, Roberto Pasolini, Claudio Sartori
2014
Abstract
Cross-domain text classification deals with predicting topic labels for documents in a target domain by leveraging knowledge from pre-labeled documents in a source domain, with different terms or different distributions thereof. Methods exist to address this problem by re-weighting documents from the source domain to transfer them to the target one or by finding a common feature space for documents of both domains; they often require the combination of complex techniques, leading to a number of parameters which must be tuned for each dataset to yield optimal performances. We present a simpler method based on creating explicit representations of topic categories, which can be compared for similarity to the ones of documents. Categories representations are initially built from relevant source documents, then are iteratively refined by considering the most similar target documents, with relatedness being measured by a simple regression model based on cosine similarity, built once at the begin. This expectedly leads to obtain accurate representations for categories in the target domain, used to classify documents therein. Experiments on common benchmark text collections show that this approach obtains results better or comparable to other methods, obtained with fixed empirical values for its few parameters.
References
- Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993-1022.
- Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120-128. Association for Computational Linguistics.
- Bollegala, D., Weir, D., and Carroll, J. (2013). Crossdomain sentiment classification using a sentiment sensitive thesaurus. IEEE Transactions on Knowledge and Data Engineering, 25(8):1719-1731.
- Cheeti, S., Stanescu, A., and Caragea, D. (2013). Crossdomain sentiment classification using an adapted nive bayes approach and features derived from syntax trees. In Proceedings of KDIR 2013, 5th International Conference on Knowledge Discovery and Information Retrieval, pages 169-176.
- Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2007a). Coclustering based classification for out-of-domain documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 210-219. ACM.
- Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2007b). Transferring naive bayes classifiers for text classification. In Proceedings of the AAAI 7807, 22nd national conference on Artificial intelligence, pages 540-545.
- Daumé III, H. (2007). Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256-263.
- Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of CIKM 7898, 7th International Conference on Information and Knowledge Management, pages 148-155. ACM.
- Gabrilovich, E. and Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence, volume 7, pages 1606-1611.
- Gao, J., Fan, W., Jiang, J., and Han, J. (2008). Knowledge transfer via multiple model local structure mapping. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 283-291. ACM.
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10-18.
- Hosmer Jr, D. W. and Lemeshow, S. (2004). Applied logistic regression. John Wiley & Sons.
- Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M., and Sch ölkopf, B. (2007). Correcting sample selection bias by unlabeled data. Advances in neural information processing systems, 19:601-608.
- Joachims, T. (1997). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proceedings of ICML 7897, 14th International Conference on Machine Learning, pages 143-151.
- Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, 1398:137-142.
- Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Honkela, J., Paatero, V., and Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3):574-585.
- Li, L., Jin, X., and Long, M. (2012). Topic correlation analysis for cross-domain text classification. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence.
- Ling, X., Dai, W., Xue, G.-R., Yang, Q., and Yu, Y. (2008a). Spectral domain-transfer learning. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 488- 496. ACM.
- Ling, X., Xue, G.-R., Dai, W., Jiang, Y., Yang, Q., and Yu, Y. (2008b). Can chinese web pages be classified with english data source? In Proceedings of the 17th international conference on World Wide Web, pages 969- 978. ACM.
- Merkl, D. (1998). Text classification with self-organizing maps: Some lessons learned. Neurocomputing, 21(1):61-77.
- Minka, T. P. (2003). A comparison of numerical optimizers for logistic regression. http://research. microsoft.com/en-us/um/people/minka/papers/ logreg/.
- Pan, S. J., Kwok, J. T., and Yang, Q. (2008). Transfer learning via dimensionality reduction. In Proceedings of the AAAI 7808, 23rd national conference on Artificial intelligence, pages 677-682.
- Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199- 210.
- Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345-1359.
- Porter, M. F. (1980). An algorithm for suffix stripping. Program: electronic library and information systems, 14(3):130-137.
- Prettenhofer, P. and Stein, B. (2010). Cross-language text classification using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1118-1127.
- Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523.
- Scott, S. and Matwin, S. (1998). Text classification using wordnet hypernyms. In Use of WordNet in natural language processing systems: Proceedings of the conference, pages 38-44.
- Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.
- Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227-244.
- Sugiyama, M., Nakajima, S., Kashima, H., Von Buenau, P., and Kawanabe, M. (2007). Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, volume 7, pages 1433- 1440.
- Wang, P., Domeniconi, C., and Hu, J. (2008). Using Wikipedia for co-clustering based cross-domain text classification. In ICDM 7808, 8th IEEE International Conference on Data Mining, pages 1085-1090. IEEE.
- Weigend, A. S., Wiener, E. D., and Pedersen, J. O. (1999). Exploiting hierarchy in text categorization. Information Retrieval, 1(3):193-216.
- Xiang, E. W., Cao, B., Hu, D. H., and Yang, Q. (2010). Bridging domains using world wide knowledge for transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(6):770-783.
- Xue, G.-R., Dai, W., Yang, Q., and Yu, Y. (2008). Topicbridged plsa for cross-domain text classification. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 627-634. ACM.
- Yang, Y. and Liu, X. (1999). A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 42-49. ACM.
- Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21st International Conference on Machine Learning, page 114. ACM.
- Zhuang, F., Luo, P., Xiong, H., He, Q., Xiong, Y., and Shi, Z. (2011). Exploiting associations between word clusters and document classes for cross-domain text categorization. Statistical Analysis and Data Mining, 4(1):100-114.
Paper Citation
in Harvard Style
Domeniconi G., Moro G., Pasolini R. and Sartori C. (2014). Cross-domain Text Classification through Iterative Refining of Target Categories Representations . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 31-42. DOI: 10.5220/0005069400310042
in Bibtex Style
@conference{kdir14,
author={Giacomo Domeniconi and Gianluca Moro and Roberto Pasolini and Claudio Sartori},
title={Cross-domain Text Classification through Iterative Refining of Target Categories Representations},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={31-42},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005069400310042},
isbn={978-989-758-048-2},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Cross-domain Text Classification through Iterative Refining of Target Categories Representations
SN - 978-989-758-048-2
AU - Domeniconi G.
AU - Moro G.
AU - Pasolini R.
AU - Sartori C.
PY - 2014
SP - 31
EP - 42
DO - 10.5220/0005069400310042