Intelligent methods are a relatively new trend in
spam detection. They may eliminate disadvantages
of the traditional methods. Intelligent methods use
statistical and machine learning algorithms. The
algorithms are capable to classify mail into several
categories using a statistical or machine learning
models constructed beforehand on the basis of the
precedent information (Yang, 1999).
To make such system work properly, it is
necessary to train it on a set of e-mails that have
been already classified as spam or legal messages.
This training’s result is a model that is then used for
a new mail classification. Nowadays the most
popular intelligent method for spam detection is
Naïve-Bayes method. (Sahami et al., 1998). Naïve-
Bayes method is being implemented and is
successfully used in several spam-detection systems
(Apache, 2004a; Farmer, 2004).
Intelligent methods have several advantages in
comparison with the traditional ones. They do not
depend on external knowledge databases and do not
need regular updates. They do not use specific
features of particular language, so they are
multilingual. They are able to adjust the models
using new samples of spam without the
administrator’s assistance and they can build
personal filtering models.
Nevertheless, despite their efficiency and
intelligence these methods are not widely used in
spam-detection systems at the enterprise level for
several reasons. First of all, most intelligent methods
are not stable enough when detecting legal mails and
have a rather high level of false-positive errors.
Intelligent methods have higher hardware
requirements because they are based on
computationally expensive algorithms.
The aim of our research is to offer a
comprehensive e-mail-classifying solution for
enterprise-level system that will be based on the
intelligent analysis of messages. The solution should
have the advantages of intelligent methods such as
personification and high spam detection rate at low
quantity of false-positive errors. At the same time
the system should provide necessary efficiency to be
used on enterprise-level mail servers.
2 OUR SOLUTION
Our solution is based on the intelligent classification
algorithm that allows reaching necessary quality on
the one hand, and on a multi-agent architecture that
provides necessary efficiency, on the other.
For solving the classification problem we are
using a statistical method based on support vector
machines (SVM) (Scholkopf & Smola, 2000;
Vapnik, 1998). This method was applied to text
categorization task earlier (Joachims, 1998). It is
necessary to solve two problems to apply SVM for
spam detection task: select proper kernel-function
and find appropriate representation of e-mails as
feature vectors.
We have selected the following representation
for electronic messages: a feature set is defined as a
set of all words that appeared in all analyzed
messages more than the predetermined number of
times. Furthermore, feature set is reduced by
eliminating a set of predefined stop-words.
Additionally, the feature set is expanded with
features defined for all file extensions of files
attached to the analyzed messages (Yang &
Pedersen, 1997).
So, each message is represented as a subset of
feature set. Each element of the set is a number of
appearances of a particular feature in a message
normalized by quantity of message’s features.
We have carried out several experiments with
various standard kernel-functions and have
discovered that RBF kernel-function shows quite
good results. It provides a high level of accuracy and
comprehensible efficiency of the algorithm.
Besides, the solution should meet the following
basic requirements: high efficiency; enterprise level;
the ability to take into account personal features of
each user’s correspondence; platform independence;
scalability; safety and privacy. These requirements
lead us to a multi-agent architecture for the system.
The general architecture of the system is shown on
the figure 1.
The central communication node of the system is
presented by one or several web-servers. It provides
communication environment for training and
classifying agents, supports shared vocabulary,
converts messages to feature sets and provides GUI
for users. The communication node stores shared
vocabulary, temporary feature vectors and some
additional user’s information in the database. All
time-consumptive operations like preprocessing and
downloading messages, training user models and
classification are moved to corresponding agents.
The training agent is a process that analyses
user’s messages and builds user’s personal model on
the basis of this analysis. The training agent allows
customization for different message storages.
In current version it is located at the centralized
mail server and accesses personal data using IMAP
protocol. Another solution might be the personal
agent on a user’s workstation that uses local mail
storage from the personal folders. The common
training workflow is the following.
A user initializes training procedure using web-
based interface.
ENTERPRISE ANTI-SPAM SOLUTION BASED ON MACHINE LEARNING APPROACH
189