Lexical analysis tasks are: management of
numbers, punctuation, singulars, special words
and proper nouns. This stage produces
candidate terms that are further checked and
retained if they are not in a stoplist.
• Stoplist Word Removal: Stoplist is a list of
words that are most frequent in a text corpus
and are not discriminative of a message
contents, such as prepositions, pronouns and
conjunctions.
• Stemming: Stemming is the process of suffix
removal to generate word stems. Several
different methods for automatic stemming are
described in (Frakes, 92). One of them, Porter
stemming algorithm, is the most common and it
was used in the system.
• Ranking Candidate KeyWords: a candidate
keyword had to appear in at least three
documents and in no more than the 33% of all
documents. Only the candidate keywords are
useful for the following stages.
2.2 Conceptual Representation
To be able to work with the messages in an abstract
way, securing a logical representation of them is
essential. The vector space model and its extensions
(Pasi, 2002) have been used traditionally. FIS-CRM
(Olivas, 2003) can be considered as an extension of
this traditional model, in charge of the
representation, within the vector attached to the
document, of the concepts inherent in the words
displayed.
Therefore FIS-CRM, is based on two main
points:
a) If a word appears in a document, its synonyms
that represent the same concept underlie it.
b) If a word appears in a document, the words that
represent a more general concept underlie it.
The fundamental basis of FIS-CRM is to “share”
the occurrences of a contained word among the
fuzzy synonyms that represent the same concept,
and to “give” a fuzzy weight to the words that
represent a more general concept that the contained
one. To obtain this aim, documents must be first
represented by their base weight vectors (based on
the occurrences of the contained words) and
afterwards, a weight readjustment process is made to
obtain a new vector (based on concept occurrences).
In this way, a word may have a fuzzy weight in the
new vector even if it is not contained in it, as long as
the referenced concept underlies the document.
To carry out the readjustment, the synonymy and
generality fuzzy interrelations has to be taken into
account, respectively obtained from a fuzzy
dictionary of synonyms (Fernandez-Lanza, 2001)
and an ontological (Kiryakov, 1999) one. The
process to be used in the conceptual representation
of already pre-processed e-mail messages on a
distribution list consists of the following steps:
1) Indexation of all the terms obtained in the pre-
process.
2) Building synonymy and ontology matrices by
storing synonymy and generality degrees from
each pair of words in the index.
3) Representation of the messages using the classic
vector space model.
4) Readjustment of the vector weights using the
FIS-CRM formulae group.
a) The vector readjustment made using the
synonymy interrelation is hindered by the
fact that there are lots of polysemic words
(words with several meanings).
b) The vector readjustment made using the
generality interrelation is linear and
proportional to the generality degree
between term A and term B.
5) Generation of the similarity matrix which will
store the degree of similarity from every pair of
messages in the collection. The matrix will be
the input to the later clustering process.
6) Storage of the essential information as meta-
data to allow the management of later incoming
messages.
2.3 Messages Clustering
Using the clustering process we will achieve the
splitting up of the collection of messages in a
reduced number of groups made up of messages
with enough conceptual similarity. Each group will
contain one or more relevant terms which will make
it different from the rest.
In this work, a hierarchical fuzzy clustering
approach is presented. The clustering procedure is
implemented by two connected and adapted
algorithm. It uses a fuzzy hierarchical clustering
algorithm to determine an initial clustering which is
then refined using the SISC (King-Ip, 2001)
clustering algorithm used in FISS meta-searcher
structure.
This algorithm is characterized by creating an
initial number (automatically calculated) of centroid
clusters, followed by an iterative process that
includes each document in the clusters whose
average similarity is upper than the threshold of
similarity (automatically calculated, but user
specified if wanted). The algorithm also considers
merging clusters and removing documents for
clusters when their average similarity decreases
under the threshold. In order to get a hierarchical
structure, big clusters and the bag cluster (formed by
A METHODOLOGY FOR INTELLIGENT E-MAIL MANAGEMENT
13