Access Prediction for Knowledge Workers in Enterprise Data Repositories

Chetan Verma, Michael Hart, Sandeep Bhatkar, Aleatha Parker-Wood, Sujit Dey


The data which knowledge workers need to conduct their work is stored across an increasing number of repositories and grows annually at a significant rate. It is therefore unreasonable to expect that knowledge workers can efficiently search and identify what they need across a myriad of locations where upwards of hundreds of thousands of items can be created daily. This paper describes a system which can observe user activity and train models to predict which items a user will access in order to help knowledge workers discover content. We specifically investigate network file systems and determine how well we can predict future access to newly created or modified content. Utilizing file metadata to construct access prediction models, we show how the performance of these models can be improved for shares demonstrating high collaboration among its users. Experiments on eight enterprise shares reveal that models based on file metadata can achieve F scores upwards of 99%. Furthermore, on an average, collaboration aware models can correctly predict nearly half of new file accesses by users while ensuring a precision of 75%, thus validating that the proposed system can be utilized to help knowledge workers discover new or modified content.


  1. Active Directory (2015). Active directory. http://msdn.
  2. Amer, A., Long, D. D. E., Paris, J.-F., and Burns, R. C. (2002). File access prediction with adjustable accuracy. In International Performance Conference on Computers and Communication (IPCCC).
  3. Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. In Conference on Uncertainty in artificial intelligence.
  4. Bybee, J. L. (1985). Morphology: A study of the relation between meaning and form, volume 9. John Benjamins Publishing.
  5. CamelCase (2015). Capitalization styles. http://msdn. 29.aspx.
  6. coca (2008). The corpus of contemporary american english: 450 million words, 1990-present. Available online at
  7. Gantz, J. and Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. In IDC iView: IDC Analyze the Future.
  8. Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527-1554.
  9. IDG Enterprise (2014). Big data survey.
  10. Jolliffe, I. (2005). Principal Component Analysis. Wiley Online Library.
  11. Kroeger, T. and Long, D. D. E. (2001). Design and implementation of a predictive file prefetching algorithm. In USENIX Annual Technical Conference, pages 105- 118.
  12. Leonardi, P. M., Huysman, M., and Steinfield, C. (2013). Enterprise social media: Definition, history, and prospects for the study of social technologies in organizations. In Journal of Computer-Mediated Communication.
  13. Linden, G., Smith, B., and York, J. (2003). Amazon. com recommendations: Item-to-item collaborative filtering. Internet Computing, 7(1):76-80.
  14. Nagori, R. and Aghila, G. (2011). LDA based integrated document recommendation model for e-learning systems. In International Conference on Emerging Trends in Networks and Computer Communications (ETNCC).
  15. Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., and Ng, A. Y. (2011). Sparse filtering. In Advances in Neural Information Processing Systems, pages 1125-1133.
  16. Office365 (2015). Microsoft office 365. 365.
  17. Ovsjanikov, M. and Chen, Y. (2010). Topic modeling for personalized recommendation of volatile items. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.
  18. Paris, J.-F., Amer, A., and Long, D. D. E. (2003). A stochastic approach to file access prediction. In International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI).
  19. Rendle, S. (2010). Factorization machines. In IEEE International Conference on Data Mining (ICDM).
  20. Salakhutdinov, R., Mnih, A., and Hinton, G. (2007). Restricted boltzmann machines for collaborative filtering. In ACM International Conference on Machine Learning.
  21. Salesforce (2015).
  22. scikit-learn (2015). scikit-learn Machine Learning in Python.
  23. Song, Q., Kawabata, T., Ito, F., Watanabe, Y., and Yokota, H. (2014). File and task abstraction in task workflow patterns for file recommendation using file-access log. In IEICE Transactions on Information and Systems.
  24. Van der Maaten, L. J., Postma, E. O., and van den Herik, H. J. (2009). Dimensionality reduction: A comparative review. Journal of Machine Learning Research, 10(1-41):66-71.
  25. Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., and Schwan, K. (2011). Statistical techniques for online anomaly detection in data centers. In IFIP/IEEE International Symposium on Integrated Network Management, pages 385-392.
  26. Whittle, G. A. S., Paris, J.-F., Amer, A., Long, D. D. E., and Burns, R. (2003). Using multiple predictors to improve the accuracy of file access predictions. In International Conference on Massive Storage Systems and Technology (MSST), pages 230-240.
  27. Xia, P., Feng, D., Jiang, H., Tian, L., Xia, P., Feng, D., Jiang, H., Tian, L., and Wang, F. (2008). Farmer: A novel approach to file access correlation mining and evaluation reference model for optimizing peta-scale file systems performance. In The International ACM Symposium on High-Performance Parallel and Distributed Computing (HPDC).
  28. Yeh, T., Long, D. D. E., and Brandt, S. A. (2001a). Performing file prediction with a program-based successor model. In Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS).
  29. Yeh, T., Long, D. D. E., and Brandt, S. A. (2001b). Using program and user information to improve file prediction performance. In International Symposium on Performance Analysis of Systems and Software (ISPASS).
  30. Yeh, T., Long, D. D. E., and Brandt, S. A. (2002). Increasing predictive accuracy by prefetching multiple program and user specific files. In Annual International Symposium on High Performance Computing Systems and Application (HPCS).

Paper Citation

in Harvard Style

Verma C., Hart M., Bhatkar S., Parker-Wood A. and Dey S. (2015). Access Prediction for Knowledge Workers in Enterprise Data Repositories . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 150-161. DOI: 10.5220/0005374901500161

in Bibtex Style

author={Chetan Verma and Michael Hart and Sandeep Bhatkar and Aleatha Parker-Wood and Sujit Dey},
title={Access Prediction for Knowledge Workers in Enterprise Data Repositories},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},

in EndNote Style

JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Access Prediction for Knowledge Workers in Enterprise Data Repositories
SN - 978-989-758-096-3
AU - Verma C.
AU - Hart M.
AU - Bhatkar S.
AU - Parker-Wood A.
AU - Dey S.
PY - 2015
SP - 150
EP - 161
DO - 10.5220/0005374901500161