Parsing and Maintaining Bibliographic References - Semi-supervised Learning of Conditional Random Fields with Constraints

Sebastian Lindner, Winfried Höhn

Abstract

This paper shows some key components of our workflow to cope with bibliographic information. We therefore compare several approaches for parsing bibliographic references using conditional random fields (CRFs). This paper concentrates on cases, where there are only few labeled training instances available. To get better labeling results prior knowledge about the bibliography domain is used in training CRFs using different constraint models. We show that our labeling approach is able to achieve comparable and even better results than other state of the art approaches. Afterwards we point out how for about half of our reference strings a correlation between journal title, volume and publishing year could be used to identify the correct journal even when we had ambiguous journal title abbreviations.

References

  1. Bellare, K., Druck, G., and McCallum, A. (2009). Alternating projections for learning with expectation constraints. In Proceedings of UAI.
  2. Chang, M.-W., Ratinov, L., and Roth, D. (2007). Guiding semi-supervision with constraint-driven learning. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 280-287.
  3. Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. In International Language Resources and Evaluation. European Language Resources Association.
  4. Duda, R. O. and Hart, P. E. (1972). Use of the hough transformation to detect lines and curves in pictures. Commun. ACM, 15(1):11-15.
  5. Ganchev, K., Graa, J., Gillenwater, J., and Taskar, B. (2010). Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11:2001-2049.
  6. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probablistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001).
  7. Mann, G. S. and McCallum, A. (2010). Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955-984.
  8. McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
  9. McCallum, A., Nigam, K., Rennie, J., and Seymore, K. (2000). Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3:127-163.
  10. Park, S. H., Ehrich, R. W., and Fox, E. A. (2012). A hybrid two-stage approach for discipline-independent canonical representation extraction from references. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL 7812, pages 285-294, New York, NY, USA. ACM.
  11. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  12. Sutton, C. and McCallum, A. (2006). Introduction to Conditional Random Fields for Relational Learning. MIT Press.
  13. Zou, J., Le, D., and Thoma, G. R. (2010). Locating and parsing bibliographic references in html medical articles. International Journal on Document Analysis and Recognition, 2:107-119.
Download


Paper Citation


in Harvard Style

Lindner S. and Höhn W. (2012). Parsing and Maintaining Bibliographic References - Semi-supervised Learning of Conditional Random Fields with Constraints . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 233-238. DOI: 10.5220/0004138602330238


in Bibtex Style

@conference{kdir12,
author={Sebastian Lindner and Winfried Höhn},
title={Parsing and Maintaining Bibliographic References - Semi-supervised Learning of Conditional Random Fields with Constraints},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={233-238},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004138602330238},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Parsing and Maintaining Bibliographic References - Semi-supervised Learning of Conditional Random Fields with Constraints
SN - 978-989-8565-29-7
AU - Lindner S.
AU - Höhn W.
PY - 2012
SP - 233
EP - 238
DO - 10.5220/0004138602330238