How to Extract Unit of Measure in Scientific Documents?

Soumia Lilia Berrahou, Patrice Buche, Juliette Dibie-Barthelemy, Mathieu Roche

2013

Abstract

A large amount of quantitative data, related to experimental results, is reported in scientific documents in a free form of text. Each quantitative result is characterized by a numerical value often followed by a unit of measure. Extracting automatically quantitative data is a painstaking process because units suffer from different ways of writing within documents. In our paper, we propose to focus on the extraction and identification of the variant units, in order to enrich iteratively the terminological part of an Ontological and Terminological Resource (OTR) and in the end to allow the extraction of quantitative data. Focusing on unit extraction involves two main steps. Since we work on unstructured documents, units are completely drowned in textual information. In the first step, our method aims at handling the crucial time-consuming process of unit location using supervised learning methods. Once the units have been located in the text, the second step of our method consists in extracting and identifying candidate units in order to enrich the OTR. The extracted candidates are compared to units already known in the OTR using a new string distance measure to validate whether or not they are relevant variants. We have made concluding experiments on our two-step method on a set of more than 35000 sentences.

References

  1. Cohen, W. W., Ravikumar, P., and Fienberg, S. E. (2003). A comparison of string distance metrics for namematching tasks. In Proc. of the Workshop on Information Integration on the Web (IIWeb-03), volume 47.
  2. Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA.
  3. Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Commun. ACM, 7(3):171-176.
  4. Hawizy, L., Jessop, D., Adams, N., and Murray-Rust, P. (2011). ChemicalTagger: a tool for semantic textmining in chemistry. Journal of cheminformatics, 3(1):17.
  5. Hiemstra, D. (2000). A probabilistic justification for using tf x idf term weighting in information retrieval. Int. J. on Digital Libraries, 3(2):131-139.
  6. Hignette, G., Buche, P., Couvert, O., Dibie-Barthélemy, J., Doussot, D., Haemmerlé, O., Mettler, E., and Soler, L. (2008). Semantic annotation of Web data applied to risk in food. International Journal of Food Microbiology, 128(1):174-180.
  7. Jessop, D. M., Adams, S. E., and Murray-Rust, P. (2011a). Mining chemical information from open patents. Journal of cheminformatics, 3(1):40.
  8. Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L., and Murray-Rust, P. (2011b). OSCAR4: a flexible architecture for chemical text-mining. Journal of cheminformatics, 3(1):1-12.
  9. John, G. H. and Langley, P. (1995). Estimating continuous distributions in bayesian classifiers. In Proc. of the conf. on Uncertainty in artificial intelligence, pages 338-345.
  10. Jones, K. S., Walker, S., and Robertson, S. E. (2000). A probabilistic model of information retrieval: development and comparative experiments - part 1. Inf. Process. Manage., 36(6):779-808.
  11. Kohavi, R. and Quinlan, J. R. (2002). Data mining tasks and methods: Classification: decision-tree discovery. In Handbook of data mining and knowledge discovery, pages 267-276. Oxford University Press, Inc.
  12. Maedche, A. and Staab, S. (2002). Measuring similarity between ontologies. In Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web, volume 2473 of LNCS, pages 251-263. Springer.
  13. Maynard, D., Li, Y., and Peters, W. (2008). Nlp techniques for term extraction and ontology population. In Proceeding of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, page 107127.
  14. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in kernel methods, pages 185-208. MIT Press.
  15. Rijgersberg, H., van Assem, M., and Top, J. (2013). Ontology of units of measure and related concepts. Semantic Web.
  16. Rijgersberg, H., Wigham, M., and Top, J. (2011). How semantics can improve engineering processes: A case of units of measure and quantities. Advanced Engineering Informatics, 25(2):276-287.
  17. Su, J., Zhang, H., Ling, C. X., and Matwin, S. (2008). Discriminative parameter learning for bayesian networks. In Proc. of the int. conf. on Machine learning, pages 1016-1023.
  18. Thompson, A. and Taylor, B. N. (2008). Guide for the use of the international system of units (SI).
  19. Touhami, R., Buche, P., Dibie-Barthélemy, J., and Ibanescu, L. (2011). An ontological and terminological resource for n-ary relation annotation in web data tables. On the Move to Meaningful Internet Systems: OTM 2011, pages 662-679.
  20. Van Assem, M., Rijgersberg, H., Wigham, M., and Top, J. (2010). Converting and annotating quantitative data tables. The Semantic Web-ISWC 2010, pages 16-31.
  21. Willems, D. J., Rijgersberg, H., and Top, J. (2012). Identifying and extracting quantitative data in annotated text. SWAIE.
  22. Wimalasuriya, D. C. and Dou, D. (2010). Ontology-based information extraction: An introduction and a survey of current approaches. Journal of Information Science, 36(3):306323.
Download


Paper Citation


in Harvard Style

Berrahou S., Buche P., Dibie-Barthelemy J. and Roche M. (2013). How to Extract Unit of Measure in Scientific Documents? . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013) ISBN 978-989-8565-75-4, pages 249-256. DOI: 10.5220/0004666302490256


in Bibtex Style

@conference{sstm13,
author={Soumia Lilia Berrahou and Patrice Buche and Juliette Dibie-Barthelemy and Mathieu Roche},
title={How to Extract Unit of Measure in Scientific Documents?},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013)},
year={2013},
pages={249-256},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004666302490256},
isbn={978-989-8565-75-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: SSTM, (IC3K 2013)
TI - How to Extract Unit of Measure in Scientific Documents?
SN - 978-989-8565-75-4
AU - Berrahou S.
AU - Buche P.
AU - Dibie-Barthelemy J.
AU - Roche M.
PY - 2013
SP - 249
EP - 256
DO - 10.5220/0004666302490256