an n
i
× m matrix based on the data belonging to
each non leaf node i, which takes O (m × n
i
) time.
The time for finding the appropriate attribute for
dividing data based on it needs constant time. To
divide the n
i
points, based on the v
l
features of
selected attribute (A
l
), O(n
i
× v
l
) time is required.
This process is repeated in each non leaf node.
Thus, if K is the maximum number of non leaf
nodes, which arises in a complete tree, then the
maximum time required for constructing a CCTree
with n elements is equal to O (K ×(n ×m +n×v
max
)).
5 CONCLUSION AND FUTURE
DIRECTIONS
Spam emails impose a cost which is non negligible,
damaging users and companies for several millions of
dollars each year. To fight spammers effectively, catch
them or analyze their behavior, it is not sufficient to
stop spam messages from being delivered to the final
recipient.
Characterizing a spam campaign sent by a specific
spammer, instead, is necessary to analyze the spam-
mer behavior. Such an analysis can be used to tailor
a more specific prevention strategy which could be
more effective in tackling the issue of spam emails.
Considering a large set of spam emails as a whole,
makes the definition of spam campaigns an extremely
challenging task. Thus, we argue that a clustering al-
gorithm is required to group this huge amount of data,
based on message similarities.
In this paper we have proposed a new categorical
clustering algorithm named CCTree, that we argue to
be useful in the problem of clustering spam emails.
This algorithm, in fact, allows an easy analysis of data
based on an informative structure. The CCTree al-
gorithm introduces an easy-to-understand representa-
tion, where it is possible to infer at a first glance the
criteria used to group spam emails in clusters. This
information can be used, for example, by officers to
track and persecute a specific subset of spam emails,
which may be related to an important crime.
In this paper, we have mainly presented the theo-
retical results of our approach, leaving the implemen-
tation of the CCTree algorithm and its usage in clus-
tering spam emails as a future work. Furthermore,
we plan to extend the presented approach including
labeling of the various clusters. In fact, we plan to
use supervised learning approach to assign a label to
the various clusters, on the base of spammer goals.
Verifying both the efficiency and effectiveness of the
proposed approach on a large dataset has also been
planned as a future work. Though preliminary, in this
work we have shown that the algorithm is efficient,
due to the low complexity. Further extensions of this
work plan to add a large set of features to best de-
scribe the structure of spam emails, in addition to the
ones already presented in this work.
REFERENCES
Anderson, D., Fleizach, C., Savage, S., and Voelker,
G. (2007). Spamscatter: Characterizing internet
scam hosting infrastructure. In Proceedings of 16th
USENIX Security Symposium on USENIX Security
Symposium.
Bergholz, A., PaaB, G., Reichartz, F., Strobel, S., and Bir-
linghoven, S. (2008). Improved phishing detection us-
ing model-based features. In In Fifth Conference on
Email and Anti-Spam, CEAS.
Blanzieri, E. and Bryl, A. (2008). A survey of learning-
based techniques of email spam filtering. Artif. Intell.
Rev., 29(1):63–92.
Calais, P., Pires, D., Guedes, D., Meira, W., Hoepers, C.,
and Steding-Jessen, K. (2008). A campaign-based
characterization of spamming strategies. In CEAS.
Calais Guerra, P., Pires, D., C. Ribeiro, M., Guedes,
D., Meira, W., Hoepers, C., H.P.C Chaves, M., and
Steding-Jessen, K. (2009). Spam miner: A platform
for detecting and characterizing spam campaigns. In-
formation Systems Applications.
Carreras, X., Marquez, L., and Salgado, J. (2001). Boosting
trees for anti-spam email filtering. In Proceedings of
RANLP-01, 4th International Conference on Recent
Advances in Natural Language Processing, Tzigov
Chark, BG, pages 58–64.
Cimiano, P., Hotho, A., and Staab, S. (2004). Comparing
conceptual, divisive and agglomerative clustering for
learning taxonomies from text.
Drucker, H., Wu, D., and Vapnik, V. (1999). Support vector
machines for spam categorization. Neural Networks,
IEEE Transactions on, 10(5):1048 –1054.
Fette, I., Sadeh, N., and Tomasic, A. (2007). Learning to
detect phishing emails. In Proceedings of the 16th
International Conference on World Wide Web, pages
649–656. ACM.
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., and Zhao,
B. (2010). Detecting and characterizing social spam
campaigns. In Proceedings of the 10th annual confer-
ence on Internet measurement, pages 35–47. ACM.
Ghahramani, Z. (2004). Unsupervised learning. In Ad-
vanced Lectures on Machine Learning, volume 3176
of Lecture Notes in Computer Science, pages 72–112.
Springer Berlin Heidelberg.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Con-
cepts and Techniques. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA, 3rd edition.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data
clustering: A review. ACM Comput. Surv., 31(3):264–
323.
ICISSP2015-1stInternationalConferenceonInformationSystemsSecurityandPrivacy
96