Authors:
Pavel Král
1
and
Christophe Cerisara
2
Affiliations:
1
University of West Bohemia, Czech Republic
;
2
LORIA UMR 7503, France
Keyword(s):
Automatic labeling, Corpus, Dialog act, Internet.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Enterprise Information Systems
;
Human-Computer Interaction
;
Intelligent User Interfaces
;
Machine Perception: Vision, Speech, Other
Abstract:
This work presents two complementary tools dedicated to the task of textual corpus creation for linguistic researches. The chosen application domain is automatic dialog acts recognition, but the proposed tools might also be applied to any other research area that is concerned with dialogs processing. The first software captures relevant dialogs from freely available resources on the World Wide Web. Filtering and parsing of these web pages is realized thanks to a set of hand-crafted rules. A second set of rules is then applied to achieve
automatic segmentation and dialog act tagging. The second software is finally used as a post-processing step to manually check and correct tagging errors when needed. In this paper, both softwares are presented, and the performances of automatic tagging are evaluated on a dialog corpus extracted from an online Czech journal. We show that reasonably good dialog act labeling accuracy may be achieved, hence greatly reducing the cost of building such cor
pora.
(More)