on the Web. Among these resources, many dialogs
in many different languages are available. Therefore,
a lot of efforts has been put in order to exploit the
World Wide Web as a primary content source for cor-
pus building. The proposed work follows the same
strategy and also focuses on extracting textual online
content for the target application. Yet, online texts
are difficult to exploit because of the wide variability
of encoding and representation formats, which makes
the extraction process challenging. Note, that in this
work, only texts are extracted, and the audio record-
ings are discarded for the time being. We have indeed
shown in previous works (Kr´al et al., 2006), (Kr´al
et al., 2007), (Kr´al et al., 2008) that textual transcrip-
tions are the most informative features for dialog act
recognition, which justifies this choice as a first ap-
proximation.
The rest of the paper is organized as follows. The
next section presents a short review of corpus cre-
ation approaches. Section 3 describes the processing
of Web pages along with the proposed approach for
automatic segmentation and dialog act labeling. Sec-
tion 4 describes the jDALabeler tool for manual an-
notation and correction, while Section 5 evaluates the
whole process on building a Czech corpus. In the last
section, we discuss these results and propose some fu-
ture research directions.
2 SHORT REVIEW OF CORPORA
CREATION
Because of the virtually unlimited amount of tex-
tual content freely available on the World Wide Web,
many research works have tried to extract and exploit
useful information from the Web in the last years, in
order to create corpora for a variety of applications
and domains. We only review next a few of these
works that are closely related to our application or
that illustrate some specific and important benefits of
exploiting these kinds of resources.
Maeda et al present in (Maeda et al., 2008) several
tools/systems for various corpus creation projects.
The following main issues are addressed:
• Data scouting: to browse Web pages and save
them in a database.
• Data selection: to chose the adequate data source.
• Data annotation: to associate the data with its cor-
responding labels.
• Speech transcription.
Most often, these processes are realized manually,
which guarantees a good quality but greatly increases
the overall cost. A major advantage of extracting in-
formation from Web pages is that textual resources
are often completed with contextual information, such
as keywords, document summary, related videos, and
so on. However, most of this information is stored
in different and non-standard formats, requiring ad-
vanced methods, such as automatic classifiers, to fil-
ter and “interpret” them. For instance, (Zhang et al.,
2006) exploit k-nearest-neighbor classifiers to grab
bilingual parallel corpora from the Web.
The work described in (Sarmento et al., 2009) is
more closely related to our work; it also exploits fine-
tuned hand-crafted rules to classify sentences. How-
ever, it mainly differs on the chosen application -
political opinions mining, on the chosen language -
Portuguese, on the textual sources - online newspa-
pers and on the manual rules. Many research efforts
in the domain of corpus creation are actually dedi-
cated to gathering and compiling available resources
in a given language, for which large enough corpora
may not exist yet (Pavel et al., 2006). The Web is very
important in these aspects, and several such works
thus focus on comparing and exploring the most ef-
ficient ways to crawl the web and retrieve relevant in-
formation (Botha and Barnard, 2005).
3 WEB PAGES PROCESSING AND
AUTOMATIC CORPUS
LABELING
The MOIS (Monitoring Internet Resources) software
is a specific Web crawler designed to process Web
pages with dialogs in order to build new corpora. The
algorithm for processing a website is the following:
1. Start from a given URL.
2. Detect, clean and save all dialogs in the Web page
corresponding to this URL.
3. Parse the dialogs: segment into sentences and an-
notate all sentences with dialog act tags.
4. Store all hyperlinks in this Web page into a list.
5. Chose (and remove from the list) one of the saved
URL and iterate from step 1 until the depth of the
website exceeds n.
3.1 Dialog Detection
During step 2, dialog detection and extraction is
achieved based on hand-made rules. These rules ex-
ploit several features, such as:
• Information about the speakers (e.g. speaker iden-
tity).
AUTOMATIC DIALOG ACT CORPUS CREATION FROM WEB PAGES
199