worth mentioning that when applied, positive vectors
are more concentrated in a cluster. It was precisely
one strong result to achieve.
6 CONCLUSION
In this paper we have addressed the problem of clus-
tering sparse binary vectors in the field of security.
Since the vectors we have to deals with cannot be ef-
ficiently clustered by known techniques, the aim was
to find a procedure that would systematically clus-
ter data in such a way that one cluster would con-
tain most of the positive vectors describing potentially
suspicious terrorist activity. With such a cluster, it is
then possible to perform an intelligence-driven post-
processing step that is tractable while it is not on the
initial sets.
The study presented here considers only four data-
sets which could be of course considered as non signi-
ficant on such a reduced number of cases. Our techni-
ques has been transferred to the police entity which
was requested this study. Their feedback show that
our techniques is still valid and robust when conside-
ring vectors of larger length but still sparse.
We hope this study will draw attention from ot-
her researchers that would be interested to investigate
with other approaches. This is the reason why we
have been authorized to share the four datasets used
in this study. Anyone interested can contact the first
author.
REFERENCES
Aggarwal, C. C. (2015). Data Mining - The Textbook.
Springer.
Aric Hagberg, Dan Schult, P. S. (2014–2018). Networkx:
Software for complex networks. https://networkx.
github.io/.
Byun, J.-W., Kamra, A., Bertino, E., and Li, N. (2006). Ef-
ficient k -anonymization using clustering techniques.
Technical report, Purdue University.
Choi, S.-S., hyuk Cha, S., and Tappert, C. (2010). A survey
of binary similarity and distance measures. Journal of
Systemics, Cybernetics and Informatics, pages 43–48.
Cover, T. M. and Thomas, J. A. (2006). Elements of In-
formation Theory (Wiley Series in Telecommunicati-
ons and Signal Processing). Wiley-Interscience, New
York, NY, USA.
Gan, G. and Wu, J. (2004). Subspace clustering for high di-
mensional categorical data. SIGKDD Explor. Newsl.,
6(2):87–94.
Goldberg, Y. and Levy, O. (2014). word2vec explained:
deriving mikolov et al.’s negative-sampling word-
embedding method. CoRR, abs/1402.3722.
He, H. and Garcia, E. A. (2009). Learning from imbalan-
ced data. IEEE Transactions on Knowledge and Data
Engineering, 21(9):1263–1284.
Jian, S., Cao, L., Lu, K., and Gao, H. (2018). Unsupervised
coupled metric similarity for non-iid categorical data.
PP:1–1.
Jones, E., Oliphant, T., Peterson, P., et al. (2001–). SciPy:
Open source scientific tools for Python. [Online;
accessed August 2018].
Kraskov, A., Stgbauer, H., Andrzejak, R. G., and Grassber-
ger, P. (2005). Hierarchical clustering using mutual
information. EPL (Europhysics Letters), 70(2):278.
Krawczyk, B. (2016). Learning from imbalanced data: open
challenges and future directions. Progress in Artificial
Intelligence, 5(4):221–232.
Li, Y. L., Venter, H., and Eloff, J. (2004). Categorizing vul-
nerabilities using data clustering techniques. In Eloff,
J., Venter, H., Labuschagne, L., and Eloff, M., editors,
Proceedings of the 3rd Conference in Information Se-
curity for South Africa. ISSA Press.
Mahdi, M. A., Abdelrahman, S. E., and Bahgat, R. (2018).
A high-performing similarity measure for categorical
dataset with sf-tree clustering algorithm. International
Journal of Advanced Computer Science and Applica-
tions, 9(5).
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013).
Efficient estimation of word representations in vector
space. CoRR, abs/1301.3781.
M
¨
unz, G., Li, S., and Carle, G. (2007). Traffic anomaly de-
tection using kmeans clustering. In In GI/ITG Works-
hop MMBnet.
Nalavade, K. and B. Meshram, B. (2014). Evaluation of k-
means clustering for effective intrusion detection and
prevention in massive network traffic data. 96:9–14.
Newman, M. E. J. (2006). Finding community structure
in networks using the eigenvectors of matrices. Phys.
Rev. E, 74:036104.
Ordonez, C. (2003). Clustering binary data streams with
k-means. In Proceedings of the 8th ACM SIGMOD
Workshop on Research Issues in Data Mining and
Knowledge Discovery, DMKD ’03, pages 12–19, New
York, NY, USA. ACM.
Pai, S. (2015). A comparison of clustering techniques for
malware analysis. Technical report, San Jose State
University.
Riadi, I., Istiyanto, J. E., Ashari, A., and Subanar (2013).
Log analysis techniques using clustering in network
forensics. CoRR, abs/1307.0072.
Steeg, G. V. and Galstyan, A. (2014). Discovering structure
in high-dimensional data through correlation explana-
tion. CoRR, abs/1406.1222.
Studen
´
y, M. and Vejnarov
´
a, J. (1998). The Multiinforma-
tion Function as a Tool for Measuring Stochastic De-
pendence, pages 261–297. Springer Netherlands, Dor-
drecht.
Su, J. and Su, C. (2017). Clustering categorical data ba-
sed on within-cluster relative mean difference. Open
Journal of Statistics, 7:173–181.
Solving a Hard Instance of Suspicious Behaviour Detection with Sparse Binary Vectors Clustering
643