Most of the previous approaches use also feature extraction techniques. Features
extracted from email’s text are then given as input to a classifier in order to filter spam
messages from legitimate texts. Anyway, spammers adopted different solutions to mis-
lead this kind of filters by obscuring text, by obfuscating words with symbols and by in-
cluding neutral text to confuse the classification process. These tricks have been studied
by anti-spam researchers in order to find new solutions to restore filtering effectiveness.
Among the different tricks used by spammers, an emerging kind of spam practice is
the so-called image spam. Here spammers use to sent their messages in attached images
that are readable by human but hidden from the filter. Even if image spam is relatively
new, various proposals have been made in the literature to address this kind of spam,
too. Most approaches use some form of embedded text detection within images. The
rationale is that spam images should contain a text whose content can spread unsolicited
commercial messages.
In particular, Wu et al. [13] defined a set of visual features in order to detect char-
acteristics common in spam images, such as embedded text and banner features. These
features are then combined with message text features for training a one-class SVM that
should be able to detect when legitimate (ham) emails are outside the spam class. Simi-
larly, Aradhye et al. [2] proposedfeatures to detect embedded text and some background
types that should be consistent with spam. Once again, they use an SVM classifier to
discriminate between ham and spam images. A different approach is instead followed
in [8]. Here the authors propose to process attached images with a state-of-the-art OCR
and then to forward OCR outputs to a text-based spam filter.
All the aforementioned approaches, however, cannot be used when text within im-
ages is voluntarily distorted and/or obfuscated. As it was noted in [3], in fact, now spam-
mers try to make OCR and text detection techniques ineffective without compromising
human readability, by placing text on non-uniform background, or by using techniques
like the ones exploited in CAPTCHAs
1
(programs that generate and grade tests that
humans can pass but current computer programs cannot).
In a recent paper, Dredze et al. [6] presented an approach to image spam detection
based on an algorithm for speed sensitive feature selection. Despite the focus of the pa-
per is mainly on a method that can efficiently process attached images, it is interesting
to note that their approach consider both feature that relies on metadata and other sim-
ple image properties (such as size and format) as well as features related to the visual
content of the image itself. So, they neither try to detect text within images, nor consider
the fact that now spammers use tricks for obfuscating this text.
In this paper we define a method for overcoming some problems that still exist
with state-of-the-art spam filters when addressing image spam. In particular, we tried
to fuse the key ideas of some of the previously described approaches, by defining two
different sets of features. A first set should characterize an image from a global point of
view, in order to detect artifacts that are typically indications of the presence of spam.
Another set of features has been instead devised for detecting malicious text in images,
1
The term CAPTCHA (Completely Automated Turing Test To Tell Computers and Humans
Apart) was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Lang-
ford of Carnegie Mellon University. At that time, they developed the first CAPTCHA to be
used by Yahoo – http://www.captcha.net/