Identifying Multidocument Relations

Erick Galani Maziero, Maria Lucía del Rosario Castro Jorge, Thiago Alexandre Salgueiro Pardo

Abstract

The digital world generates an incredible accumulation of information. This results in redundant, complementary, and contradictory information, which may be produced by several sources. Applications as multidocument summarization and question answering are committed to handling this information and require the identification of relations among the various texts in order to accomplish their tasks. In this paper we first describe an effort to create and annotate a corpus of news texts with multidocument relations from the Cross-document Structure Theory (CST) and then present a machine learning experiment for the automatic identification of some of these relations. We show that our results for both tasks are satisfactory.

References

  1. Afantenos, S. D.; Doura, I.; Kapellou, E.; Karkaletsis, V. (2004). Exploiting Cross-Document Relations for Multi-document Evolving Summarization. In the Proceedings of SETN, pp. 410- 419.
  2. Aleixo, P. e Pardo, T. A. S. (2008). CSTNews: Um Córpus de Textos Jornalísticos Anotados segundo a Teoria Discursiva Multidocumento CST (Cross-document Structure Theory). Série de Relatórios Técnicos do Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, no. 326. São Carlos-SP, Maio, 12p.
  3. Aleixo, P. e Pardo, T. A. S. (2008). CSTTool: Uma Ferramenta Semi-automática para Anotação de Córpus pela Teoria Discursiva Multidocumento CST. Série de Relatórios Técnicos do Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, no. 321. São Carlos-SP, Maio, 11p.
  4. Aleixo, P. and Pardo, T. A. S. (2008). Finding Related Sentences in Multiple Documents for Multidocument Discourse Parsing of Brazilian Portuguese Texts. In Anais do VI Workshop em Tecnologia da Informação e da Linguagem Humana - TIL, pp. 298-303.
  5. Allan, J.; Carbonell, J.; Doddington, G.; Yamron, J.; Yang, Y. (1998). Topic detection and tracking pilot study: final report. In the Proceedings of the DARPA Broadcast News Understanding and Transcription Workshop.
  6. Anacleto, J. C.; Carvalho, A. F. P.; Pereira, E. N.; Ferreira, A. M.; Carlos, A. F. (2008). Machines with good sense: How can computers become capable of sensible reasoning? Artificial Intelligence in Theory and Practice II, Vol. 276, pp. 195-204.
  7. Bick, E. (2000). The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. PhD thesis. Aarhus University. Denmark University Press.
  8. Carletta, J. (1996). Assessing Agreement on Classification Tasks: The Kappa Statistic. Computational Linguistics, Vol. 22, N. 2, pp. 249-254.
  9. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, Vol. 16, pp. 321-357.
  10. Jorge, M. L. C. (2010). Sumarização automática multidocumento: seleção de conteúdo com base no modelo CST (Cross-document Structure Theory). Tese de Doutorado. Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo.
  11. Mann, W. C. and Thompson, S. A. (1987). Rhetorical Structure Theory: A Theory of Text Organization. Technical Report ISI/RS-87-190.
  12. Miyabe, Y.; Takamura, H.; Okumura, M. (2008). Identifying Cross-Document Relations between Sentences. In the Proceedings of the Third International Joint Conference on Natural Language Processing, pp. 141-148.
  13. Pardo, T. A. S. (2006). SENTER: Um Segmentador Sentencial Automático para o Português do Brasil. Série de Relatórios do NILC. NILC-TR-06-01. São Carlos-SP, Janeiro, 6p.
  14. Prati, R. C.; Batista, G. E. A. P. A.; Monard, M. C. (2008). Curvas ROC para avaliação de classificadores. IEEE América Latina, Vol. 6, N. 2.
  15. Radev, D. R. (2000). A common theory of information fusion from multiple text sources, step one: Cross-document structure. In the Proceedings of the 1st ACL SIGDIAL Workshop on Discourse and Dialogue.
  16. Radev, D. R. and McKeown, K. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics, Vol. 24, N. 3, pp. 469-500.
  17. Radev, D.R.; Otterbacher, J.; Zhang, Z. (2004). CST Bank: A Corpus for the Study of Crossdocument Structural Relationships. In the Proceedings of Fourth International Conference on Language Resources and Evaluation.
  18. Trigg, R. (1983). A Network-Based Approach to Text Handling for the Online Scientific Community. Ph.D. Thesis. Department of Computer Science, University of Maryland.
  19. Trigg, R. and Weiser, M. (1987). TEXTNET: A network-based approach to text handling. ACM Transactions on Office Information Systems, Vol. 4, N. 1, pp. 1-23.
  20. Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.
  21. Zhang, Z.; Otterbacher, J.; Radev, D. R. (2003). Learning Cross-document Structural Relationships using Boosting. In the Proceedings of the twelfth international conference on Information and knowledge management, pp. 124-130.
  22. Zhang, Z. and Radev, D. R. (2004). Combining Labeled and Unlabeled Data for Learning Cross-Document Structural Relationships. In the Proceedings of IJCNLP, pp. 32-41.
  23. Zhang, Z.; Blair-Goldensohn, S.; Radev, D. R. (2002). Towards CST-enhanced summarization. In the Proceedings of the Eighteenth National Conference on Artificial Intelligence, pp. 439-445.
Download


Paper Citation


in Harvard Style

Galani Maziero E., del Rosario Castro Jorge M. and Salgueiro Pardo T. (2010). Identifying Multidocument Relations . In Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010) ISBN 978-989-8425-13-3, pages 60-69. DOI: 10.5220/0003028800600069


in Bibtex Style

@conference{nlpcs10,
author={Erick Galani Maziero and Maria Lucía del Rosario Castro Jorge and Thiago Alexandre Salgueiro Pardo},
title={Identifying Multidocument Relations},
booktitle={Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)},
year={2010},
pages={60-69},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003028800600069},
isbn={978-989-8425-13-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science - Volume 1: NLPCS, (ICEIS 2010)
TI - Identifying Multidocument Relations
SN - 978-989-8425-13-3
AU - Galani Maziero E.
AU - del Rosario Castro Jorge M.
AU - Salgueiro Pardo T.
PY - 2010
SP - 60
EP - 69
DO - 10.5220/0003028800600069