significant improvement for documents written in
Portuguese and a minor improvement for Czech, as
representatives of morphologically rich languages,
regarding precision results.
Additionally we also extend the preliminary
discussion started in (Teixeira, Lopes, & Ribeiro,
2011) where some of the metrics used in current
work were presented. To achieve our aims we
compare results obtained by using four basic metrics
(Tf-Idf, Phi-square, Mutual Information and Relative
Variance) and derived metrics taking into account
per word character median length of words and
multi-words and giving specific attention to word
extremities of multi-words and of words (where left
and right extremities of a word are considered to be
identical to the word proper). This led to a first
experiment where we compare 12 metrics (3 variants
of 4 metrics). On a second experiment, we decided
to use a different document representation in terms
of word prefixes of 5 characters in order to tackle
morphologically rich languages. As it would be
senseless to evaluate the relevance of prefixes, it
became necessary to project (bubble) prefix
relevance into words and into multi-words.
All the experimental results were manually
evaluated and agreement between evaluators was
assessed using k-Statistics.
This paper is structured as follows: related work
is summarized in section 2; our system, the data and
the experimental procedures used are described in
section 3; the metrics used are defined in section 4;
results obtained are shown in section 5; conclusions
and future work are discussed in section 6.
2 RELATED WORK
In the area of document classification it is necessary
to select features that later will be used for training
new classifiers and for classifying new documents.
This feature selection task is somehow related to the
extraction of key terms addressed in this paper. In
(Sebastiani, 2002), a rather complete overview of the
main metrics used for feature selection for document
classification and clustering is made.
As for the extraction of multi-words and
collocations, since we also need to extract them, we
just mention the work by (Silva & Lopes, 1999),
using no linguistic knowledge, and the work by
(Jacquemin, 2001), requiring linguistic knowledge.
In the area of keyword and key multi-word
extraction, (Hulth, 2003), (Ngomo, 2008),
(Martínez-Fernández, García-Serrano, Martínez, &
Villena, 2004), (Cigarrán, Peñas, Gonzalo, &
Verdejo, 2005), (Liu, Pennell, Liu, & Liu, 2009)
address the extraction of keywords in English.
Moreover those authors use language dependent
tools (stop-words removing, lemmatization, part-of-
speech tagging and syntactic pattern recognition) for
extracting noun phrases. Being language
independent, our approach clearly diverges from
these ones. Approaches dealing with extraction of
key-phrases (that are according to the authors “short
phrases that indicate the main topic of a document”)
include the work of (Katja, Manos, Edgar, &
Maarten de, 2009) where Tf-Idf metric is used as
well as several language dependent tools. In
(Mihalcea & Tarau, 2004), a graph-based ranking
model for text processing is used. The authors use a
2-phase approach for the extraction task: first they
select key-phrases representative of a given text;
then they extract the most “important” sentences in a
text to be used for summarizing document content.
In (Peter, 2000) the author tackles the problem of
automatically extracting key-phrases from text as a
supervised learning task. And he deals with a
document as a set of phrases, which his classifier
learns to identify as positive or negative examples of
key-phrases.
(Lemnitzer & Monachesi, 2008) deal with eight
different languages, use statistical metrics aided by
linguistic processing, both to extract key phrases and
keywords. Dealing also with more than one
language, (Silva & Lopes, 2009) extract key multi-
words using purely statistical measures. In (Silva &
Lopes, 2010) statistical extraction of keywords is
also tackled but a predefined ratio of keywords and
key multi-words is considered per document, thus
jeopardizing statistical purity.
(Matsuo & Ishizuka, 2004) present, a keyword
extraction algorithm that applies to isolated
documents, not in a collection. They extract frequent
terms and a set of co-occurrences between each term
and the frequent terms.
In summary, the approach followed in our work
is unsupervised, language independent and extracts
key words or multi-words solely depending on their
ranking values obtained by applying the 20 metrics
announced and explained bellow in section 4.
3 SYSTEM DATA AND
EXPERIMENTS
Our system is made of 3 distinct modules. First
module is responsible for extracting multi-words,
based on (Silva, Dias, Guilloré, & Lopes, 1999) and
using the extractor of (Gomes, 2009). A Suffix
ICAART 2012 - International Conference on Agents and Artificial Intelligence
56