Using Conditional Random Fields with Constraints to Train Support Vector Machines - Locating and Parsing Bibliographic References

Sebastian Lindner

2013

Abstract

This paper shows how bibliographic references can be located in HTML and then be separated into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints and prior knowledge about the bibliographic domain can be used to split bibliographic references into fields e.g. authors and title, when only a few labeled training instances are available. For this purpose an algorithm for automatic keyword extraction and a unique set of features and constraints is introduced. Features and the output of this Conditional Random Field (CRF) for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) are then used to find the bibliographic reference section in an article. First, a separation of the HTML document into blocks of consecutive inline elements is done. Then we compare one machine learning approach using a Support Vector Machines (SVM) with another one using a CRF for the reference locating process. In contrast to other reference locating approches, our method can even cope with single reference entries in a document or with multiple reference sections. We show that our reference location process achieves very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches and sometimes even outperforms them.

References

  1. Bollacker, K. D., Lawrence, S., and Giles, C. L. (1998). CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Proceedings of the second international conference on Autonomous agents, pages 116-123. ACM.
  2. Chang, M.-W., Ratinov, L., and Roth, D. (2007). Guiding semi-supervision with constraint-driven learning. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 280-287.
  3. Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. In International Language Resources and Evaluation. European Language Resources Association.
  4. Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Thirteenth International Joint Conference on Articial Intelligence, volume 2, pages 1022-1027. Morgan Kaufmann Publishers.
  5. Finkel, J. R. (2007). Named entity recognition and the stanford NER software.
  6. Fontan, L., Lopez-Garcia, R., Alvarez, M., and Pan, A. (2012). Automatically extracting complex data structures from the web. International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012).
  7. Ganchev, K., Graa, J., Gillenwater, J., and Taskar, B. (2010). Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11:2001-2049.
  8. Gao, L., Qi, X., Tang, Z., Lin, X., and Liu, Y. (2012). Webbased citation parsing, correction and augmentation. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, pages 295-304. ACM.
  9. Ha, J., Haralick, R. M., and Phillips, I. T. (1995). Recursive XY cut using bounding boxes of connected components. In Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, volume 2, pages 952-955. IEEE.
  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10-18.
  11. Hetzner, E. (2008). A simple method for citation metadata extraction using hidden markov models. In Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, pages 280-284. ACM.
  12. Jain, A. K. and Yu, B. (1998). Document representation and its application to page decomposition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(3):294-308.
  13. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. (2001). Improvements to platt's SMO algorithm for SVM classifier design. Neural Computation, 13(3):637-649.
  14. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probablistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), pages 282-289.
  15. Lindner, S. and Höhn, W. (2012). Parsing and Maintaining Bibliographic References. International Conference on Knowledge Discovery and Information Retrieval (KDIR 2012).
  16. Mann, G. S. and McCallum, A. (2010). Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955-984.
  17. McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
  18. McCallum, A., Nigam, K., Rennie, J., and Seymore, K. (2000). Automating the contruction of internet portals with machine learning. Information Retrieval Journal, 3:127-163.
  19. Park, S. H., Ehrich, R. W., and Fox, E. A. (2012). A hybrid two-stage approach for discipline-independent canonical representation extraction from references. In Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL 7812, pages 285-294, New York, NY, USA. ACM.
  20. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  21. Sutton, C. and McCallum, A. (2006). Introduction to Conditional Random Fields for Relational Learning. MIT Press.
  22. Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173-180. Association for Computational Linguistics.
  23. Zhai, Y. and Liu, B. (2006). Structured data extraction from the web based on partial tree alignment. Knowledge and Data Engineering, IEEE Transactions on, 18(12):1614-1628.
  24. Zou, J., Le, D., and Thoma, G. R. (2010). Locating and parsing bibliographic references in html medical articles. International Journal on Document Analysis and Recognition, 2:107-119.
Download


Paper Citation


in Harvard Style

Lindner S. (2013). Using Conditional Random Fields with Constraints to Train Support Vector Machines - Locating and Parsing Bibliographic References . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013) ISBN 978-989-8565-75-4, pages 28-36. DOI: 10.5220/0004546100280036


in Bibtex Style

@conference{kdir13,
author={Sebastian Lindner},
title={Using Conditional Random Fields with Constraints to Train Support Vector Machines - Locating and Parsing Bibliographic References},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)},
year={2013},
pages={28-36},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004546100280036},
isbn={978-989-8565-75-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)
TI - Using Conditional Random Fields with Constraints to Train Support Vector Machines - Locating and Parsing Bibliographic References
SN - 978-989-8565-75-4
AU - Lindner S.
PY - 2013
SP - 28
EP - 36
DO - 10.5220/0004546100280036