Data Cleaning Technique for Security Big Data Ecosystem

Diana Martínez-Mosquera, Sergio Luján-Mora

2017

Abstract

The information networks growth have given rise to an ever-multiplying number of security threats; it is the reason some information networks currently have incorporated a Computer Security Incident Response Team (CSIRT) responsible for monitoring all the events that occur in the network, especially those affecting data security. We can imagine thousands or even millions of events occurring every day and handling such amount of information requires a robust infrastructure. Commercially, there are many available solutions to process this kind of information, however, they are either expensive, or cannot cope with such volume. Furthermore, and most importantly, security information is by nature confidential and sensitive thus, companies should opt to process it internally. Taking as case study a university's CSIRT responsible for 10,000 users, we propose a security Big Data ecosystem to process a high data volume and guarantee the confidentiality. It was noted during implementation that one of the first challenges was the cleaning phase after data extraction, where it was observed that some data could be safely ignored without affecting result's quality, and thus reducing storage size requirements. For this cleaning phase, we propose an intuitive technique and a comparative proposal based on the Fellegi-Sunter theory.

References

  1. Alexandrov, Bergmann R., Freytag S., Hueske F., Heise A., Kao O., Leich M., Leser U., Markl V., Naumann F., Peters M., Rheinländer A., Sax M., Schelter S., Höger M., Tzoumas K., Warneke D., 2014. The Stratosphere Platform for Big Data Analytics. In VLDB Journal, vol. 23, no 6, pp. 939-964.
  2. Arputhamary B., Arockiam L., 2015. Data Integration in Big Data Environment. In Bonfring International Journal of Data Mining, vol. 5, no 1, pp. 1-5.
  3. Aye, T. T., 2011. Web log cleaning for mining of web usage patterns. In Third International Conference on Computer Research and Development, vol. 2, pp. 490- 494.
  4. Bhandare M., Barua K., Nagare V., 2013. Generic Log Analyzer Using Hadoop Map Reduce Framework. In International Journal of Emerging Technology and Advanced Engineering, vol. 3, no 9, pp. 603-607.
  5. Brizan, Guy D., Tansel, Uz A., 2006. Survey of Entity Resolution and Record Linkage Methodologies. In Communications of the International Information Management Association, vol. 6, no 3, pp. 41-50.
  6. Cárdenas A., Manadhata P., Rajan S., 2015. Big Data Analytics for Security. In IEEE Security & Privacy, vol. 11, no 6, pp. 74-76.
  7. Fellegi, I. P., Sunter A. B., 1969. A Theory for Record Linkage. In Journal of the American Statistical Association, vol. 64, no 328, pp. 1183-1210.
  8. Gill R., Singh J., 2014. An Open Source ETL Tool-Medium and Small Scale Enterprise ETL (MaSSEETL). In International Journal of Computer Applications, vol. 108, no 4, pp. 15-22.
  9. Intel white paper, 2013. Extract, Transform, and Load Big Data with Apache Hadoop. Available at: https://software.intel.com/sites/default/files/article/402 274/etl-big-data-with-hadoop.pdf. Intel Corporation.
  10. Khalifa S., Elshater Y.,Sundaravarathan K., Bhat A., Martin P., Iman F., Rope D., McRoberts M., Statchuk C., 2016. The Six Pillars for Building Big Data Analytics Ecosystems. In ACM Computing Surveys, vol. 49, no 2, pp. 33:1-33:35.
  11. Khayyat Z., Ilyas I. F., Jindal A., Madden S., Ouzzani M., Papotti P., Yin S., 2015. Bigdansing: A System for Big Data Cleansing. In ACM SIGMOD International Conference on Management of Data, pp. 1215-1230.
  12. Krishnan S., Haas D., Franklin M., Wu E., 2016. Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations. In ACM SIGMOD/PODS Conference Workshop on Human In the Loop Data Analytics, p. 9.
  13. Maletic, J. I., Marcus, A., 2009. Data cleansing: A prelude to knowledge discovery. In Data Mining and Knowledge Discovery Handbook. Ed. by Maimon, O., Rokach, L. US :Springer, pp. 19-32.
  14. Nehe M., 2016. Malware and Log file Analysis Using Hadoop and Map Reduce. In International Journal of Engineering Development and Research, vol. 4, no 2, pp. 529-533.
  15. Winkler W. E., 1988. Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In Proceedings of the Section on Survey Research Methods, American Statistical Association, vol. 667, pp. 671.
  16. Winkler W. E., 2003. Data Cleaning Methods. In Proceedings of the ACM SIGKDD Workshop on Data Cleaning, Record Linkage, and Object Consolidation.
Download


Paper Citation


in Harvard Style

Martínez-Mosquera D. and Luján-Mora S. (2017). Data Cleaning Technique for Security Big Data Ecosystem . In Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-245-5, pages 380-385. DOI: 10.5220/0006360603800385


in Bibtex Style

@conference{iotbds17,
author={Diana Martínez-Mosquera and Sergio Luján-Mora},
title={Data Cleaning Technique for Security Big Data Ecosystem},
booktitle={Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2017},
pages={380-385},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006360603800385},
isbn={978-989-758-245-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Data Cleaning Technique for Security Big Data Ecosystem
SN - 978-989-758-245-5
AU - Martínez-Mosquera D.
AU - Luján-Mora S.
PY - 2017
SP - 380
EP - 385
DO - 10.5220/0006360603800385