parts (Hearst, 1997), (Heinonen, 1998). To penalize
deviations from the expected segment length, several
methods use the notion of "length model"
(Heinonen, 1998), (Ponte and Croft, 1997). Dynamic
programming is often used in order to calculate the
globally minimal segmentation cost (Heinonen,
1998), (Reynar, 1994), (Xiang and Hongyuan,
2003), (Kehagias et al., 2004), (Qi et al., 2008).
Current approaches involve the improvement of the
dotplotting technique (Yen et al., 2005), the
improvement of Latent Semantic Analysis (Bestgen,
2006) and the improvement of Hearst’s TextTiling
method (Hearst, 1997) presented by (Kern and
Granitzer, 2009).
Information extraction, from a different point of
view, aims to locate within a text passage domain-
specific and pre-specified facts (e.g., in a passage
about athletics, facts about the athlete participating
in a 100m event, such as name, nationality,
performance, as well as facts about the specific
event, like the event name). More specifically,
information extraction is about -among others-
extracting from texts: (a) Entities: textual fragments
of particular interest, such as persons, places,
organizations, dates, etc. (b) Mentions: the
identification of all lexicalisations of an entity in
texts. For example, the name of a particular person
can be mentioned in different ways inside a single
document, such as “Lebedeva”, “Tatiana Lebedeva”,
or “T. Lebedeva”. The following pre-processing
steps are applied in order to perform information
extraction: (a) Named Entity Recognition, where
entity mentions are recognized and classified into
proper types for the thematic domain in question (b)
Co-reference, where all the mentions that represent
the same entity are identified and grouped together
according to the entity they refer to.
Co-reference resolution complementary includes
the step of anaphora resolution. The term anaphora
denotes the phenomenon of referring to an entity
already mentioned in a text -most often with the help
of a pronoun or a different name. Co-reference
basically involves the following steps: (a)
pronominal co-reference (which is about finding the
proper antecedent for personal pronouns), possessive
adjectives, possessive pronouns, reflexive pronouns
and pronouns this and that (b) identification of cases
where both the anaphor and the antecedent refer to
identical sets or types. This identification requires
some world knowledge or specific domain
knowledge. It also includes cases such as reference
to synonyms or the case where the anaphor matches
exactly or is a substring of the antecedent (c) ordinal
anaphora for cardinal numbers and adjectives such
as "former" and "latter".
The importance of text segmentation and
information extraction is apparent in a number of
applications, such as noun phrase chunking, tutorial
dialogue segmentation, focused crawling, text
summarization, semantic segmentation and web
content mining. In (Fragkou, 2009) the potential use
of text segmentation in the information extraction
process was examined. In this paper the reverse
problem is examined i.e., the use of information
extraction techniques in the text segmentation
process. Those techniques are applied on a
benchmark used for text segmentation, resulting in
the creation of an "annotated" corpus. Evaluation
was performed using three well-known segmentation
algorithms (Choi et al., 2001), (Kehagias et al.,
2004) and (Utiyama and Isahara, 2001) applied both
in the original as well as the “annotated” corpus.
Α similar work was presented in (Sitbon and
Bellot, 2005). The authors used two corpora. The
first one was a manually-built, French-news corpus
which contained four series of 100 documents,
where each document was composed of ten
segments extracted from "Le Monde" journal. The
second one was referring to a single topic (sport). In
each of those corpora, they performed named entity
recognition using three types of named entities:
person name, location, and organization. The authors
state use of anaphors but provide no further details.
They used named entity instances as components of
lexical chains to perform text segmentation. Their
results showed that, the use of named entities does
not improve segmentation accuracy.
3 METHOD
Existing algorithms performing text segmentation
exploit a variety of word co-occurrence statistic
techniques in order to calculate the homogeneity
between segments, where each segment refers to a
single topic. However, they do not exploit the
importance that several words may have in a specific
context. Examples of such words are person names,
locations, dates, group of names, scientific terms etc.
The importance of those terms is further diminished
by the application of word processing techniques,
i.e., stop list removal and stemming on words such
as pronouns or adjectives. We aim to exploit
whether the identification of such words can be
beneficial for the segmentation task. This
identification requires the application of named
entity recognition and co-reference resolution thus,
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
350