(57000 patterns). Over that set of data, we have tested
several classification methods, after some data bal-
ancing techniques have been applied. Then, the best
five have been deeply proved over several training and
test divisions, and with two methods: using sequential
patterns (consecutive URL accesses), and taking them
in a randomly way.
The results show that classification accuracies are
between 95% and 97%, even when using the un-
balanced datasets. However, they have been dimin-
ished because of the possible loss of data that comes
from performing an undersampling (removing pat-
terns) method; or taking the training and the data sets
in a sequential way from the main log file, due to the
fact that certain URL requests can be made only at a
certain time.
In this way, we can conclude that the approach has
been successful and it would be a useful tool in an
enterprise.
Future lines of work include conducting a deeper
set of experiments trying to test the generalisation
power of the method, maybe considering bigger data
divisions, bigger data sets (from a whole day or work-
ing day), or adding some kind of ‘noise’ to the dataset.
So that, considering the good classification results ob-
tained in this work, the next step could be the ap-
plication of these methods in the real system from
which data was gathered, counting with the opinion
of expert CSOs, in order to know the real value of
the proposal. The study of other classification meth-
ods could be another research branch, along with the
implementation of a Genetic Programming approach,
which could deal with the imbalance problem using a
modification of the cost associated to misclassifying,
could be done (as the authors did in (Alfaro-Cid et al.,
2007)).
Finally, we also point to extract additional infor-
mation from the URL string, than could be trans-
formed into additional features that could be more
discriminative than the current set. Moreover, a data
process involving summarizing data about sessions
(such as number of requests per client, or average time
connection) will be also considered.
ACKNOWLEDGEMENTS
This paper has been funded in part by European
project MUSES (FP7-318508), along with Span-
ish National project TIN2011-28627-C04-02 (ANY-
SELF), project P08-TIC-03903 (EVORQ) awarded
by the Andalusian Regional Government, and
projects 83 (CANUBE), and GENIL PYR-2014-17,
both awarded by the CEI-BioTIC UGR.
REFERENCES
Alfaro-Cid, E., Sharman, K., and Esparcia-Alczar, A.
(2007). A genetic programming approach for
bankruptcy prediction using a highly unbalanced
database. In Giacobini, M., editor, Applications of
Evolutionary Computing, volume 4448 of Lecture
Notes in Computer Science, pages 169–178. Springer
Berlin Heidelberg.
Anderson, A. J. P. (1980). Computer security threat mon-
itoring and surveillance. Technical report, James P.
Anderson Co., Fort Washington, PA.
Blanco, L., Dalvi, N., and Machanavajjhala, A. (2011).
Highly efficient algorithms for structural clustering of
large websites. In WWW ’11 Proceedings of the 20th
international conference on World wide web., pages
437–446. ACM.
Breiman, L. (2001). Random forests. Machine Learning,
45(1):5–32.
Chawla, N. (2005). Data mining for imbalanced datasets:
An overview. In Maimon, O. and Rokach, L., edi-
tors, Data Mining and Knowledge Discovery Hand-
book, pages 853–867. Springer US.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer,
W. P. (2002). Smote: Synthetic minority over-
sampling technique. J. Artif. Int. Res., 16(1):321–357.
Chen, H., Chung, W., Qin, Y., Chau, M., Xu, J. J., Wang, G.,
Zheng, R., and Atabakhsh, H. (2003). Crime data min-
ing: An overview and case studies. In Proceedings of
the 3rd National Conference for Digital Government
Research (dg.o 2003), volume 130, pages 1–5. Digital
Government Society of North America.
Clifton, C. and Marks, D. (1996). Security and privacy im-
plications of data mining. In ACM SIGMOD Work-
shop on Research Issues on Data Mining and Knowl-
edge Discovery, pages 15–19.
Danezis, G. (2009). Inferring privacy policies for so-
cial networking services. In Proceedings of the 2Nd
ACM Workshop on Security and Artificial Intelligence,
AISec ’09, pages 5–10, New York, NY, USA. ACM.
de Vel, O., Anderson, A., Corney, M., and Mohay, G.
(2001). Mining e-mail content for author identifica-
tion forensics. SIGMOD Record, 30(4):55–64.
Domingos, P. and Pazzani, M. (1997). On the optimality
of the simple bayesian classifier under zero-one loss.
Machine Learning, 29:103–137.
Elomaa, T. and Kaariainen, M. (2001). An analysis of re-
duced error pruning. Artificial Intelligence Research,
15(-):163–187.
Frank, E. and Witten, I. H. (1998). Generating accurate
rule sets without global optimization. In Shavlik, J.,
editor, Fifteenth International Conference on Machine
Learning, pages 144–151. Morgan Kaufmann.
Frank, E. and Witten, I. H. (2011). Data Mining: Practi-
cal Machine Learning Tools and Techniques. Morgan
Kaufmann Publishers, third edition.
Greenstadt, R. and Beal, J. (2008). Cognitive security for
personal devices. In Proceedings of the 1st ACM
Workshop on Workshop on AISec, AISec ’08, pages
27–30, New York, NY, USA. ACM.
GoingaStepBeyondtheBlackandWhiteListsforURLAccessesintheEnterprisebyMeansofCategoricalClassifiers
133