USING THE STRUCTURAL CONTENT OF DOCUMENTS TO AUTOMATICALLY GENERATE QUALITY METADATA

Lars Fredrik Høimyr Edvardsen, Ingeborg Torvik Sølvberg, Trond Aalberg, Hallvard Trætteberg

Abstract

Giving search engines access to high quality document metadata is crucial for efficient document retrieval efforts on the Internet and on corporate Intranets. Presence of such metadata is currently sparsely present. This paper presents how the structural content of document files can be used for Automatic Metadata Generation (AMG) efforts, basing efforts directly on the documents’ content (code) and enabling effective usage of combinations of AMG algorithms for additional harvesting and extraction efforts. This enables usage of AMG efforts to generate high quality metadata in terms of syntax, semantics and pragmatics, from non-homogenous data sources in terms of visual characteristics and language of their intellectual content.

References

  1. Bird, K. and the Jorum Team. 2006. Automated Metadata - A review of existing and potential metadata automation within Jorum and an overview of other automation systems. 31st March 2006, Version 1.0, Final, Signed off by JISC and Intrallect July 2006.
  2. Boguraev, B. and Neff, M. 2000. Lexical Cohesion, Discourse Segmentation and Document Summarization. In In RIAO-2000, Content-Based Multimedia Information Access.
  3. Bruce, T.R. and Hillmann, D.I. 2004. The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions, In Metadata in Practice, D. Hillmann & E Westbrooks, eds., ISSN: 0-8389-0882-9
  4. Cardinaels, K., Meire, M. and Duval, E. 2005. Automating metadata generation: the simple indexing interface. In Proceedings of the 14th international conference on World Wide Web, Chiba, Japan, pp.548-556, ISBN:1- 59593-046-9
  5. Flynn, P., Zhou, L., Maly, K., Zeil, S. and Zubair, M. 2007. Automated Template-Based Metadata Extraction Architecture. Proceedings of the ICADL 2007.
  6. Giuffrida, G., Shek, E. C. and Yang, J. 2000. KnowledgeBased Metadata Extraction from PostScript Files. In Digital Libraries, San Antonio, Tx, 2000 ACM 1-581 13-231-X/00/0006
  7. Google. 2009. Google. http://www.google.com
  8. Greenberg J., Spurgin, K., Crystal, A., Cronquist, M. and Wilson, A. 2005. Final Report for the AMeGA (Automatic Metadata Generation Applications) Project. UNC School of information and library science.
  9. Greenberg, J. 2004. Metadata Extraction and Harvesting: A Comparison of Two Automatic Metadata Generation Applications. In Journal of Internet Cataloging, 6(4): 59-82.
  10. Greenstone. 2007. Source only distribution. http://prdownloads.sourceforge.net/greenstone/gsdl2.72-src.tar.gz (source code inspected)
  11. IEEE LTSC. 2005. IEEE P1484.12.3/D8, 2005-02-22 Draft Standard for Learning Technology - Extensible Markup Language Schema Definition Language Binding for Learning Object Metadata, WG12: Related Materials.
  12. It's learning. 2009. It's learning. http://www.itslearning.com
  13. Jenkins, C. and Inman, D. 2001. Server-side Automatic Metadata Generation using Qualified Dublin Core and RDF. 0-7695-1022-1/01, 2001 IEEE
  14. Kawtrakul A. and Yingsaeree C. 2005. A Unified Framework for Automatic Metadata Extraction from Electronic Document. In Proceedings of IADLC2005 (25-26 August 2005), pp. 71-77.
  15. Li, H., Cao, Y., Xu, J., Hu, Y., Li, S. and Meyerzon, D. 2005a. A new approach to intranet search based on information extraction. Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, Pages: 460 - 468, ISBN:1-59593-140-6, ACM New York, NY, USA.
  16. Li, Y., Dorai, C. and Farrell, R. 2005b. Creating MAGIC: system for generating learning object metadata for instructional content. Proceedings of the 13th annual ACM international conference on Multimedia, Hilton, Singapore, pp.367-370, ISBN:1-59593-044-2
  17. Liddy, E.D., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N.E., Diekema, A., McCracken, N.J., Silverstein, J. and Sutton, S.A. 2002. Automatic metadata generation and evaluation. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15, Tampere, Finland, ACM Press, New York, pp.401-402.
  18. Lindland, O.I., Sindre, G., Sølvberg, A. 1994. Understanding Quality in Conceptual Modeling. In IEEE Software, march 1994, Volume: 11, Issue: 2, pp. 42-49, ISSN: 0740-7459, DOI: 10.1109/52.268955
  19. Liu, Y., Bai, K., Mitra, P, and Giles, C.L. 2007. TableSeer: Automatic Table Metadata Extraction and Searching in Digital Libraries. Proceedings in JCDL'07, June 18-23, 2007, Vancouver, Canada, ACM 978-1-59593-644-8/07/0006
  20. LOMGen. 2006. LOMGen. http://www.cs.unb.ca/agentmatcher/LOMGen.html
  21. Meire, M., Ochoa, X. and Duval, E. 2007. SAmgI: Automatic Metadata Generation v2.0. In Proceedings of World Conference on Educational Multimedia, Hypermedia and Telecommunications 2007, pp. 1195- 1204, Chesapeake, VA: AACE
  22. Open Archives Initiative. 2004. Protocol for Metadata Harvesting - v.2.0. http://www.openarchives.org/OAI/ openarchivesprotocol.html
  23. Scirus. 2009. Scirus - for scientific information. http://www.scirus.com
  24. Seymore, K., McCallum, A. and Rosenfeld, R. 1999. Learning hidden Markov model structure for information extraction. In Proc. of AAAI 99 Workshop on Machine Learning for Information Extraction, pages 37-42, 1999.
  25. Singh, A., Boley, H. and Bhavsar, V.C. 2004. LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology. National Research Council and University of New Brunswick, Learning Objects Summit Fredericton, NB, Canada, March 29- 30, 2004
  26. Yahoo. 2009. Yahoo!. http://www.yahoo.com
Download


Paper Citation


in Harvard Style

Edvardsen L., Sølvberg I., Aalberg T. and Trætteberg H. (2009). USING THE STRUCTURAL CONTENT OF DOCUMENTS TO AUTOMATICALLY GENERATE QUALITY METADATA . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 354-363. DOI: 10.5220/0001841003540363


in Bibtex Style

@conference{webist09,
author={Lars Fredrik Høimyr Edvardsen and Ingeborg Torvik Sølvberg and Trond Aalberg and Hallvard Trætteberg},
title={USING THE STRUCTURAL CONTENT OF DOCUMENTS TO AUTOMATICALLY GENERATE QUALITY METADATA},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={354-363},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001841003540363},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - USING THE STRUCTURAL CONTENT OF DOCUMENTS TO AUTOMATICALLY GENERATE QUALITY METADATA
SN - 978-989-8111-81-4
AU - Edvardsen L.
AU - Sølvberg I.
AU - Aalberg T.
AU - Trætteberg H.
PY - 2009
SP - 354
EP - 363
DO - 10.5220/0001841003540363