The results here indicate that even an unsupervi-
sed technique such as EM clustering could be quite
strong against the current crop of image spam. Ho-
wever, if spammers use somewhat more advanced
techniques, it is highly unlikely that the resulting
image spam can be effectively detected using any
combination of the 38 image processing based fea-
tures we have considered in this paper.
Using samples of real-world ham and spam images,
we showed that machine learning algorithms based on
features extracted by image processing techniques can
be used to construct strong classifiers. Our results on
these real-world datasets improves slightly over the
related work in (Annadatha and Stamp, 2016).
We also showed that it is not difficult to generate
much stronger image spam, in the sense that the de-
tection problem is significantly more challenging. In
addition, we showed that such improved image spam
cannot be reliably detected using the image proces-
sing based features considered here. These results
improve over the challenge dataset presented in (An-
nadatha and Stamp, 2016), in the sense that the chal-
lenge dataset in this paper is significantly more diffi-
cult to distinguish from ham, even when using a richer
and more informative feature set. These results indi-
cate that we will likely need new approaches to detect
image spam in the future.
More research is needed to develop and analyze
improved methods for image spam detection. To this
end, we have developed a large image spam challenge
dataset that we will provide to any researchers in this
field. By experimenting on this challenge dataset,
it will be possible to directly compare results based
on different proposed detection techniques. Additi-
onal experiments involving this dataset using neural
networks and deep learning would be timely, and it
would be interesting to have such a direct compari-
son between the SVM analysis in this paper, and deep
learning techniques.
