AUTOMATIC DIALOG ACT CORPUS CREATION FROM WEB PAGES

Pavel Král, Christophe Cerisara

Abstract

This work presents two complementary tools dedicated to the task of textual corpus creation for linguistic researches. The chosen application domain is automatic dialog acts recognition, but the proposed tools might also be applied to any other research area that is concerned with dialogs processing. The first software captures relevant dialogs from freely available resources on the World Wide Web. Filtering and parsing of these web pages is realized thanks to a set of hand-crafted rules. A second set of rules is then applied to achieve automatic segmentation and dialog act tagging. The second software is finally used as a post-processing step to manually check and correct tagging errors when needed. In this paper, both softwares are presented, and the performances of automatic tagging are evaluated on a dialog corpus extracted from an online Czech journal. We show that reasonably good dialog act labeling accuracy may be achieved, hence greatly reducing the cost of building such corpora.

References

  1. Austin, J. L. (1962). How to do Things with Words. Clarendon Press, Oxford.
  2. Botha, G. and Barnard, E. (2005). Two approaches to gathering texte corpora from the world wide web. In In Proceedings of the 16th Annual Symposium of the Pattern Recognition Association of South Africa, pages 194-197, Langebaan, South Africa.
  3. Dhillon, R., S., B., Carvey, H., and E., S. (2004). Meeting Recorder Project: Dialog Act Labeling Guide. Technical Report TR-04-002, International Computer Science Institute.
  4. Jekat et al., S. (1995). Dialogue Acts in VERBMOBIL. In Verbmobil Report 65.
  5. Jeong, M. and Lee, G. G. (2006). Jointly predicting dialog act and named entity for spoken language understanding. In IEEE/ACL 2006 Workshop on Spoken Language Technology.
  6. Král, P., Cerisara, C., and Klec?ková, J. (2006). Automatic Dialog Acts Recognition based on Sentence Structure. In ICASSP'06, pages 61-64, Toulouse, France.
  7. Král, P., Cerisara, C., and Klec?ková, J. (2007). Confidence Measures for Semi-automatic Labeling of Dialog Acts. In ICASSP'07, pages 153-156, Honolulu, Hawaii, USA.
  8. Král, P., Pavelka, T., and Cerisara, C. (2008). Evaluation of Dialogue Act Recognition Approaches. In MLSP'08, pages 492-497, Cancun, Mexico.
  9. Maeda, K., Lee, H., Medero, S., Medero, J., Parker, R., and Strassel, S. (2008). Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium. In Sixth International Language Resources and Evaluation (LREC'08), Marrakech, Marocco. http://www.lrecconf.org/proceedings/lrec2008/.
  10. Pavel, D. S. H., Sarkar, A. I., and Khan, M. (2006). A proposed automated extraction procedure of bangla text for corpus creation in unicode. In In Proceedings of International Conference on Computer Processing de Bangla, ICCPB, pages 157-161, Dhaka, Bangladesh.
  11. Sarmento, L., Carvalho, P., Silva, M. J., and de Oliveira, E. (2009). Automatic creation of a reference corpus for political opinion mining in user-generated content. In TSA 7809: Proceeding of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion, pages 29-36, New York, NY, USA. ACM.
  12. Zhang, Y., Wu, K., Gao, J., and Vines, P. (2006). Automatic acquisition of chinese-english parallel corpus from the web. in: Ecir2006. In In Proceedings of 28th European Conference on Information Retrieval, pages 420-431. Springer.
Download


Paper Citation


in Harvard Style

Král P. and Cerisara C. (2010). AUTOMATIC DIALOG ACT CORPUS CREATION FROM WEB PAGES . In Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 5: ICEIS, ISBN 978-989-8425-08-9, pages 198-203. DOI: 10.5220/0003019501980203


in Bibtex Style

@conference{iceis10,
author={Pavel Král and Christophe Cerisara},
title={AUTOMATIC DIALOG ACT CORPUS CREATION FROM WEB PAGES},
booktitle={Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 5: ICEIS,},
year={2010},
pages={198-203},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003019501980203},
isbn={978-989-8425-08-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 5: ICEIS,
TI - AUTOMATIC DIALOG ACT CORPUS CREATION FROM WEB PAGES
SN - 978-989-8425-08-9
AU - Král P.
AU - Cerisara C.
PY - 2010
SP - 198
EP - 203
DO - 10.5220/0003019501980203