Table 7: Average F1-scores for test set (15%) of 100,000
sample subset for different number of reactions (rows) and
consistencies (columns). Bold records were tested further.
0 0.5 0.6 0.75
>0 0.88 0.89 0.88 0.89
>1 0.9 0.89 0.93 0.94
>2 0.91 0.91 0.91 0.94
>3 0.91 0.92 0.92 0.93
>4 0.9 0.91 0.93 0.94
>5 0.92 0.92 0.93 0.94
>7 0.92 0.93 0.93 0.95
>10 0.92 0.93 0.94 0.95
>20 0.93 0.94 0.96 0.97
Table 8: Model F1-scores for chosen consistencies and min
reaction counts.
Min reactions consistency samples F1-score
0 0 649815 87%
3 0.6 140522 93%
5 0.6 98551 93%
10 0.6 59828 95%
20 0.6 34933 96%
3 0.75 102837 93%
5 0.75 66220 95%
10 0.75 39690 95%
20 0.75 22245 96%
8 CONCLUSIONS AND FUTURE
WORK
The proposed CNN architecture for the email image
spam classification task can achieve state-of-the-art
performance on publicly available datasets. 99% F1-
score on (Dredze et al., 2007) and Princeton datasets
and 96% F1-score on combination of the datasets. It
also achieves up to 96% F1-score on the presented
Email.cz image spam v1 dataset.
Email.cz image spam v1 dataset is published as
part of this work. This dataset focuses on being re-
cent and it is based on real email traffic. Due to this
fact, the data have to be anonymized which is done by
publishing only features extracted from the images.
Those features are extracted by CNN (ResNet v1).
The dataset is published via Academic Torrents plat-
form which is distributed in its nature, that should en-
sure that the data will be available for others in the fu-
ture. We were also considering the sufficiency of the
anonymization and concluded, that it is maybe possi-
ble to partially reconstruct the image data. However,
it would be computationally very expensive and the
level of detail that is needed for recognizing personal
information is already lost in the feature vector (Lis-
tik, 2018).
For future work, we want to gather a dataset in
a longer time range, which will contain also images
correctly classified by the current anti-spam solution.
Thus it will lead to a much bigger dataset. Our other
suggestion is to use a more complex model architec-
ture or a more sophisticated reaction filtering tech-
nique for higher performance.
ACKNOWLEDGEMENTS
We want to thank Seznam.cz company (Email.cz
owner) for providing us the data for the dataset
creation, computational power and the time of the
Email.cz team.
REFERENCES
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,
M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,
G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-
lur, M., Levenberg, J., Man
´
e, D., Monga, R., Moore,
S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-
houcke, V., Vasudevan, V., Vi
´
egas, F., Vinyals, O.,
Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and
Zheng, X. (2015). TensorFlow: Large-scale machine
learning on heterogeneous systems. Software avail-
able from tensorflow.org.
Aradhye, H. B., Myers, G. K., and Herson, J. A. (2005).
Image analysis for efficient categorization of image-
based spam e-mail. In Document Analysis and Recog-
nition, 2005. Proceedings. Eighth International Con-
ference on, pages 914–918. IEEE.
Biggio, B., Fumera, G., Pillai, I., and Roli, F. (2011). A
survey and experimental evaluation of image spam
filtering techniques. Pattern Recognition Letters,
32(10):1436–1446.
Carpinteiro, O. A., Sanches, B. C., and Moreira, E. M.
(2017). Detecting image spam with an artificial neu-
ral model. International Journal of Computer Science
and Information Security, 15(1):296.
Chollet, F. et al. (2015). Keras. https://github.com/fchol
let/keras.
Cormack, G. V. and Lynam, T. R. (2005). Trec 2005 spam
track overview. In TREC, pages 500–274.
Dredze, M., Gevaryahu, R., and Elias-Bachrach, A. (2007).
Learning fast classifiers for image spam. In CEAS.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Email Image Spam Classification based on ResNet Convolutional Neural Network
463