AUTOMATIC EMAIL CLASSIFICATION USING USER

PREFERENCE ONTOLOGY

Niladri Chatterjee

Department of Mathematics, Indian Institute of Technology, Delhi, India

Saroj Kaushik

Department of Computer Science and Engineering, Indian Institute of Technology, Delhi, India

Smit Rastogi, Varun Dua

Department of Mathematics, Indian Institute of Technology, Delhi, India

Keywords:

Email classiﬁcation, Ontology, Automatic email categorization, Inference engine, Feature weighing.

Abstract:

In this work we have extended and implemented an ontology based approach for email classiﬁcation based

on user characteristics proposed by Kim et al.(2007). The approach focuses on ﬁnding relationships between

user interests and their responses to emails. Rules and Ontology are created using the data and metadata of

user characteristics, their preferences and responses to emails. Rules and ontology are then used to predict the

response of a user to a new email. In Kim et al. (2007) approach, labels to emails were provided manually by

a human expert. We have endeavored to remove the human intervention by developing an Automated Email

Categorizer to provide label to an email based on its contents. We have also proposed a new term weighing

method for emails to incorporate prominence of subject terms. Finally, we have integrated and tested the

Ontology Based Classiﬁer in conjunction with Email Categorizer where the former effectively uses the label

provided by latter to classify an email based on user preferences.

1 INTRODUCTION

Email has emerged as an efﬁcient and popular com-

munication mechanism in recent times. But one of

the major problems is the huge number of unso-

licited/spam mails that ﬁll in user’s inbox regularly.

Consequently, email management has gained atten-

tion and has become an important problem for indi-

viduals and organizations.

One major aspect of email management is clas-

siﬁcation of emails. In this context the terms email

classiﬁcation and email ﬁltering are often used inter-

changeably. Filtering may be considered as a sub-

class of classiﬁcation with just two classes (Itskevitch,

2001); while one of its applications being spam ﬁlter-

ing.

Several Email-ﬁltering techniques have been de-

veloped so far. The successful ones are either rule

based or statistics based (Zhang and Yao, 2003). Re-

cently some attempts have been made towards ontolo-

gy based email-ﬁltering that allows machine-

understandable semantics of data (Youn and Mcleod,

2006).

(Brewer et al., 2006) have suggested a system

that uses ontology to discover relationships between

tokens in an email thus utilizing semantics of an

email as an additional parameter for classiﬁcation. A

framework for email ﬁltering using ontology, based

on frequency count of words in an email, is proposed

in (Youn and Mcleod, 2007). The counts of various

words occurring in spam/non-spam training emails

were used to create a decision tree for classiﬁcation.

The decision tree formed was represented in the form

of ontology using RDF. A new email is then classiﬁed

using this ontology.

Several supervised learning methods based on

statistical analysis have been applied to the email

ﬁltering task (Zhang et al., 2004). However, the

major drawback with all the statistical approaches is

that they do not incorporate the interests of a speciﬁc

165

Chatterjee N., Kaushik S., Rastogi S. and Dua V..

AUTOMATIC EMAIL CLASSIFICATION USING USER PREFERENCE ONTOLOGY.

DOI: 10.5220/0003061501650170

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2010), pages 165-170

ISBN: 978-989-8425-29-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

user towards particular types of mails. Content

based classiﬁers cannot incorporate preferences and

contexts to take a particular feature in an email as

a parameter for classiﬁcation. Users behaviours

to emails vary according to the different personal

preferences. Therefore it is meaningful to provide

user-oriented classiﬁer based on the preferences.

An email ﬁltering scheme based on the per-

sonalized ontology to classify an email as spam or

non-spam, has been proposed in (Youn and Mcloed,

2009)). They created a personalized ontology by

combining a user proﬁle ontology and a structured

taxonomy. The user proﬁle ontology creates a black-

list of contacts and topic words while the structured

taxonomy is used as a global ontology ﬁlter.

An ontology based email-ﬁlter incorporating the

preferences of a user has been proposed by (Kim

et al., 2007). A user preference ontology is built

to formally represent the important concepts and

rules derived from a data mining process. Then an

inference engine is used that utilizes the knowledge

to predict the user’s action on new incoming emails.

This approach utilizes a label given manually to each

email, based on its content. The labels are from a

predeﬁned set of categories, and represent the content

of the email in the ontology and the data mining

process. The major drawback of this approach is that

the label has to be provided manually by a human

expert. However, to read and provide a label to an

email is a very tedious task.

In this paper we have extended the approach

suggested by (Kim et al., 2007) by proposing an

improvement in terms of automatically providing a

label to each email through a text categorizer. We

have built an email-categorizer which when given an

email as input, outputs a label for that email. The

ontology based classiﬁer uses this label along with

the user preferences and characteristics to predict

the response of the user for the email. For the text

categorizer, we have also proposed a new term

weighing method speciﬁcally applicable to emails

which enhances the weights of terms depending

on their position in the email. However, our major

contribution has been in terms of the categorizer

and usage of its result in classiﬁcation process,

along with some improvements in the classiﬁcation

process (Table 2 and section 4.3) as compared to that

originally suggested by (Kim et al., 2007).

The paper is organized as follows. In Section 2 we

describe the overall system design. Section 3 explains

the implementation of the scheme including essential

sub-components, such as processing of emails, data

collection, rule generation, ontology construction

and ﬁnal classiﬁcation procedure. Section 4 provides

analysis of the results and comparison with the

original system. Finally, Section 5 concludes this

paper.

2 OVERALL SYSTEM DESIGN

We have assumed that broadly users might have two

responses to an email: Interested or Not-interested.

The proposed architecture for the system is shown in

Figure 1.

Figure 1: Architecture of the proposed system.

The system has two main components: Automatic

Email Categorizer and Ontology Based Classiﬁer. To

predict the response of a user for any email, ﬁrst the

label for that email is generated by the Categorizer

based on email’s contents. The Classiﬁer takes that

label along with the preferences of the user as input

to predict the response of that user to that email. The

Categorizer has been trained and tested independently

for various parameters, the details of which are pre-

sented in section 3.1. The architecture of these two

subsystems is explained in the following sub-sections.

2.1 Automatics Email Categorizer

Architecture of the Automatic Email Categorization

(AEC) is shown in Fig. 2. AEC assigns label to a new

email based on its contents. Each component of the

classiﬁer is brieﬂy explained below.

Figure 2: Architecture of Automatic Email Categorization

System.

Database consists of Ken Langs 20newsgroups

dataset (Rennie, 2010) which contains approximately

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

166

Figure 3: Architecture of the Ontology based Classiﬁcation System.

20000 messages taken from 20 different netnews

newsgroups partitioned (nearly) evenly across 20 cat-

egories.

Initially, the AEC system is trained using Nave

Bayesian Classiﬁer with labeled emails which are

represented by feature vectors containing important

terms. Once the system is trained, a label to a new

unlabeled email is predicted.

2.2 Ontology based Email Classiﬁer

The ontology based classiﬁer uses the label of new

email provided by AEC system along with user pref-

erences and ﬁnally classify whether the user is inter-

ested in this mail or not. The architecture of the ontol-

ogy based classiﬁer is shown in Fig. 3. The ontology

is constructed (using OWL-DL) to represent the meta-

data involving user characteristics, email category and

response of a user towards an email.

A database of user characteristics (gender, age

etc), their preferences and their responses to sample

emails (interested or not-interested) is created through

a survey from different users. In Fig. 3, User At-

tributes represent the characteristics and preferences

of various users. Email corresponds to the email for

which a Response from the user has been sought for

training the system. The details are given in Section

3. After building the database, we mine it using the

WEKA tool to ﬁnd correlations between the prefer-

ences of users and their responses to emails and if-

then Rules are generate. The steps involved in rule

generation are explained in detail in section 3.2.

The rules and the ontology are loaded in JENA

inference engine (a Java framework for building Se-

mantic Web applications) and the system predicts re-

sponse to a new email for the user. An accuracy of

the system is calculated by comparing the predicted

results of test emails with the actual responses. Sec-

tion 4 has the details.

3 IMPLEMENTATION DETAILS

In this section we describe brieﬂy the implementa-

tion of both the components namely, Automatic Email

Categorizer and Ontology Based Classiﬁer with Rule

generation.

3.1 Automatics Email Categorizer

We have used supervised learning approach for build-

ing proposed categorizer where we have utilized the

structure of emails to propose a new method for term

weighing which works visibly better for emails as

compared to other traditional term weighing methods.

As a result our proposed method outperforms the tra-

ditional term weighing methods (TF, TF-IDF).

3.1.1 Feature Selection

In our system we used only the contents of subject

and body of the documents from 20newsgroup. These

contain all the necessary informative terms. Non-

contextual/optional words are removed using a stop-

list. Stemming has been performed using Porter’s al-

gorithm.

To further reduce the size of feature space, we

used several feature selection methods and analyzed

their performances. We used the term frequency (TF)

and the document frequency (DF) feature selection

methods both for retaining the most important fea-

tures. We have also used Mutual Information (MI)

and Chi-square feature selection methods. MI and

Chi-square methods, although computationally ex-

pensive, yield signiﬁcantly better performances. As

observed from Table 1, the top ﬁve terms selected by

Chi-square and MI methods are more relevant to win-

dows category than the terms selected by DF and TF.

AUTOMATIC EMAIL CLASSIFICATION USING USER PREFERENCE ONTOLOGY

167

Figure 4: Structure of the ontology.

3.1.2 Weights of Features

Weight of a term t in document d is calculated by the

following formula (Eq. (1)).

= (n

+ An

)log(D/d f

× (t f

)/(t f

/T ))

(1)

Here, n

is the number of occurrences of the term

in body, n

is the number of occurrences of the term

in the subject, D denotes the number of emails in the

training set, d f

is the document frequency of the term

t. t f

is the number of occurrences of term t in doc-

ument d and T

is the total count of all the terms in

document d. Similarly t f

is the number of occur-

rences of term t in the entire training set and T is the

total count of all the terms in training set. ‘A denotes

the adjustment value according to the importance of

the subject terms in the email.

We estimate the adjustment value by experiments.

Experiments show that the proposed term weighing

method signiﬁcantly enhances the performance of the

categorizer.

Table 1: Top 5 features of different categories using dif-

ferent feature selection methods in category comp.os.ms-

windows.misc.

Chi-sq MI DF TF

nt directori m window

app load microsoft File

setup product support Do

mous print univers Re

desktop font read Write

3.2 Ontology based Email Classiﬁcation

We merged some of the categories, in the 20news-

group dataset, that were similar to each other and re-

duced 20 categories to 8 categories namely, Sports,

Automobiles, Religion, Advertisements, Computers,

Politics, Science, Medicine. We selected 50 emails

belonging to these categories. User attributes and

preferences, along with their responses to these 50 se-

lected emails from 30 users were also collected. We

created a user proﬁle format to represent the user in-

terests and their responses to emails in binary. Fig.

4 shows the structure of ontology that captures the

metadata of user preferences and their responses.

Email Label is the label consists of Interested or

Not-Interested. The rest of the attributes describe

the interests of a user towards a particular category.

Nodes in the ontology are self explanatory.

A decision tree was trained to discover the as-

sociation rules between various groups of users and

their responses by the sample email data. We used

WEKA API for the implementation of C4.5 decision

tree. The Rules generated were of following form: If

Email Label = ‘sports & sports = ‘False’ & gender

= ‘Male & computers = ‘True then ‘Interested.

The developed ontology and the rules are used to

predict the response of a user for a new email.

4 EXPERIMENTS, RESULTS

AND ANALYSIS

Experiments were performed to test the accuracy of

both: 1) Email Categorizer which provides a label to

emails 2) Ontology Based Classiﬁer which predicts

the response of a user towards an email based on user

characteristics, preferences and the label of test email.

In this section, ﬁrst we present the results and analysis

of the Categorizer along with the performance of the

proposed term weighing method. Finally we present

the results of the Ontology Based Classiﬁer and its

comparison with original scheme.

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

168

Table 2: Comparison of the proposed system with Kim’s scheme.

S. No. Aspect of the Classiﬁer

Original Scheme by Kim et al.

(2007)

Proposed Scheme

1 Emails used Korean emails collected

English emails from 20news-

group

2 Email Label Manually provided Automatically generated

Data collection for Ontology

based Classiﬁer

40 emails-90 students 50 emails-30 students

4 Responses sought from users Inbox/Reply/Delete/Spam Interested/Not-Interested

5 Decision tree used ID3 C4.5

6 Ontology language Web-PDDL OWL-DL

Reasoner used for response pre-

diction

OntoEngine

Jenas Rule based Inference En-

gine

8 No. of rules generated 89 42

9 Overall Accuracy 60.1% 67.6%

4.1 Analysis of Categorization System

Fig. 5 shows the performance of several methods used

for feature space reduction using subject-based term

weighing method. The number of features for each

method has been set to 20000. As observed from the

graph, the Chi-square method outperforms all other

methods in dataset used. We have compared the

Figure 5: Performance of various feature selection methods.

performance of proposed Subject-based term weigh-

ing method with traditional term weighing methods

(TF and TF-IDF) over MI and Chi-square feature

selection methods. The proposed weighing method

considerably outperforms TF and TF-IDF weighing

methods over all the feature selection methods ex-

perimented with. Fig.6 and Fig.7 display the results

of the experiments conducted with MI feature selec-

tion and Chi-square feature selection methods respec-

tively. The adjustment value A in our proposed term

weighing method (Eq 1) was taken as 5 while com-

paring the term weighing methods. The effect of vari-

ation in value of A over the categorizer performance

was also analyzed.

Figure 6: Comparison of term weighing methods (using

Mutual Information feature selection method).

Figure 7: Comparison of term weighing methods (using

Chi-square feature selection method).

4.2 Analysis of Ontology based

Classiﬁer

Table 2 presents a comparison of our system with the

classiﬁcation system developed by (Kim et al., 2007).

We outlined the differences in the methods, imple-

mentations and results, an outright comparison in

terms of accuracy being not possible since the datasets

used are different. Kims system used Korean emails

for experimentation while our system works for En-

glish emails. The overall ontology based scheme we

have proposed is on similar lines as (Kim et al., 2007)

but with a few modiﬁcations and a few marked dif-

AUTOMATIC EMAIL CLASSIFICATION USING USER PREFERENCE ONTOLOGY

169

Table 3: Axioms derived by C 4.5 Decision Tree mining and their performance.

S. No. Rules Accuracy(%)

Email Label = ‘sports’ & sports = ‘False’ & gender = ‘Male’ & computers =

‘True’ then Interested

100

Email Label = ‘religion’ & religion = ‘False’ & medicine = ‘True’ & age =

‘Junior’ then NotInterested

83.3

Email Label = ‘politics’ & age = ‘Junior’ & religion = ‘False’ & computers =

‘True’ & gender = ‘Male’ then Interested

83.33

Email Label = ‘sports’ & sports = ‘False’ & gender = ‘Female’ then NotInter-

ested

77.78

Email Label = ‘religion’ & religion = ‘False’ & medicine = ‘True’ & age =

‘Senior’ & gender = ‘Female’ then NotInterested

66.67

ferences in implementation (as outlined in Table 2),

the major improvement being the automatic labeling

of emails by a categorizer.

As proposed by (Kim et al., 2007), Rule Accuracy

or Rule Conﬁdence can be used to calculate correct-

ness of each rule in the ontology. For our classiﬁca-

tion scheme, this score can be a metric based on con-

ﬁdence of the Rule because more is the conﬁdence of

Rule, more is the probability that the Response pre-

dicted is correct. Some of the rules along with their

accuracy are listed in Table 3. The overall accuracy

of our system was found to be 67.6%.

5 CONCLUSIONS

We have extended, and implemented a method for

email classiﬁcation originally proposed by (Kim et

al., 2007). The salient feature of the approach is that it

incorporates user preferences towards the ﬁnal classi-

ﬁcation decision. Another important feature is utiliza-

tion of a user preference ontology to classify emails.

We have removed human intervention from the task of

providing label to each email manually, by integrat-

ing the ontology based classiﬁcation system with an

automated email categorization system. The limited

amount of data collected might have been a constraint

towards getting clear correlation between user pref-

erences and their responses since the collected data

might have not captured all correlations.

This work also establishes the better performance

of the proposed term weighing method over the con-

ventional TF and TF-IDF methods when tested on

emails. Various feature selection methods were also

compared using the naive Bayesian classiﬁer among

which Chi-square method gave best results. However,

since emails behave little differently than normal doc-

uments during categorization, incorporation of more

heuristics and parameters might improve the accuracy

of the categorizer and how the effect of any heuristic

varies with change in categorization algorithms would

be an interesting thing to observe.

REFERENCES

Brewer, D., Thirumalai, S., Gomadam, K., Li, K., 2006. To-

wards an Ontology Driven Spam Filter. In Proceed-

ings of 22nd International Conference on Data Engi-

neering Workshops.

Itskevitch, J. 2001. Automatic hierarchical e-mail classi-

ﬁcation using association rules. MS Thesis, Simon

Fraser University.

Kim, J., Dou, D., Liu, H., Kwak, D., 2007. Constructing

A User Preference Ontology for Anti-spam Mail Sys-

tems. In Proceedings of the 20th Conference of the

Canadian Society For Computational Studies of intel-

ligence on Advances in Artiﬁcial intelligence. Mon-

treal, Canada.

Rennie, J. Ken Lang 2010. 20newsgroup dataset

http://people.csail.mit.edu/jrennie/20Newsgroups

Youn S., McLeod D., 2006. Ontology Development Tools

for Ontology-Based Knowledge Management. Ency-

clopedia of E-Commerce, E-government and Mobile

Commerce, Idea Group Inc.

Youn, S., Mcleod, D., 2007. Spam Email Classiﬁcation us-

ing an Adaptive Ontology. In Proceedings of 4th In-

ternational Conference on Information Technology:

New Generations (ITNG). Las Vegas, NV.

Youn S., Mcleod D., 2009. Spam Decisions on Gray E-

mail using Personalized Ontologies. In Proceedings

of the 2009 ACM symposium on Applied Computing

SAC09, Honolulu, Hawaii, U.S.A.

Zhang, L. and Yao, T., 2003. Filtering Junk Email with a

Maximum Entropy Model. In ICCPOL03, Shen yang,

China.

Zhang, L., Zhu, J., Yao, T., 2004. An Evaluation of Spam

Filtering Techniques. In ACM transactions on Asian

Language Information Processing, Vol.3 No.4

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

170