mented our framework using Apache Spark. We par-
titioned our data and run our models in parallel to
achieved high scalability. Our framework provides an
efficient data processing method for large scale Vir-
tual Machines in a cloud Settings. We use ARIMA, a
statistical model for time series to predicts our VMs.
With short-term prediction, we accurately predicted
61% of the total 28,858 VMs analysed. In term of
execution time, on average, each VM is analysed and
predicted in three seconds.
To the best of our knowledge, we are the first
to analyse as many as 28,858 long-running VMs of
Azure VMs traces and accurately predict over 17K
VMs.
We observed that most VMs have one or several
spikes of which majority of those spikes are not sea-
sonal. Future work requires further investigation on
thos spikes and to find out whether they are potential
anomalies.
ACKNOWLEDGEMENTS
This project was funded by Petroleum Tech-
nology Development Fund (PTDF) of Nigeria
(PTDF/ED/PHD/AA/1133/17).
We thank Michael Davis and Mohsen Koohi
Esfahani for comments that greatly improved the
manuscript, and we thank all the anonymous review-
ers for their insights.
REFERENCES
Aggarwal, C. C. (2018). A survey of stream clustering algo-
rithms. In Data Clustering, pages 231–258. Chapman
and Hall/CRC.
Ahmar, A. S., Guritno, S., Rahman, A., Minggi, I., Tiro,
M. A., Aidid, M. K., Annas, S., Sutiksno, D. U., Ah-
mar, D. S., Ahmar, K. H., et al. (2018). Modeling
data containing outliers using arima additive outlier
(arima-ao). In Journal of Physics: Conference Series,
volume 954, page 012010. IOP Publishing.
Alkatheri, S., Abbas, S., and Siddiqui, M. (2019). A com-
parative study of big data frameworks. International
Journal of Computer Science and Information Secu-
rity,, page 8.
Andrews, D. W. (1991). Heteroskedasticity and autocorre-
lation consistent covariance matrix estimation. Econo-
metrica: Journal of the Econometric Society, pages
817–858.
Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A.,
Rosen, J., Stoica, I., Wendell, P., Xin, R., and Zaharia,
M. (2015). Scaling spark in the real world: perfor-
mance and usability. Proceedings of the VLDB En-
dowment, 8:1840–1843.
Assunc¸
˜
ao, M. D., Calheiros, R. N., Bianchi, S., Netto,
M. A., and Buyya, R. (2015). Big data computing
and clouds: Trends and future directions. Journal of
Parallel and Distributed Computing, 79-80:3 – 15.
Box, G. E., Jenkins, G. M., Reinsel, G. C., and Ljung, G. M.
(2015). Time series analysis: forecasting and control.
John Wiley & Sons.
Brockwell, P. J. and Davis, R. A. (2016). Introduction to
time series and forecasting. springer.
Brownlee, J. (2017). Introduction to time series forecasting
with python: how to prepare data and develop models
to predict the future. Machine Learning Mastery.
Calheiros, R. N., Masoumi, E., Ranjan, R., and Buyya, R.
(2015). Workload prediction using arima model and
its impact on cloud applications’ qos. IEEE Transac-
tions on Cloud Computing, 3:449–458.
Cheboli, D., Chandola, V., and Kumar, V. (2010). Anomaly
detection for time series: A survey. Technical re-
port, Technical Report in progress, University of Min-
nesota, Department of . . . .
Comden, J., Yao, S., Chen, N., Xing, H., and Liu, Z. (2019).
Online optimization in cloud resource provisioning:
Predictions, regrets, and algorithms. Proceedings of
the ACM on Measurement and Analysis of Computing
Systems, 3(1):16.
Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fon-
toura, M., and Bianchini, R. (2017). Resource cen-
tral: Understanding and predicting workloads for im-
proved resource management in large cloud platforms.
In Proceedings of the 26th Symposium on Operating
Systems Principles, pages 153–167. ACM.
Fehlmann, T. and Kranich, E. (2014). Exponentially
weighted moving average (ewma) prediction in the
software development process. In 2014 Joint Confer-
ence of the International Workshop on Software Mea-
surement and the International Conference on Soft-
ware Process and Product Measurement, pages 263–
270.
Gersch, W. and Brotherton, T. (1980). Ar model prediction
of time series with trends and seasonalities: A contrast
with box-jenkins modeling. In 1980 19th IEEE Con-
ference on Decision and Control including the Sympo-
sium on Adaptive Processes, pages 988–990.
Gounaris, A., Kougka, G., Tous, R., Montes, C. T., and Tor-
res, J. (2017). Dynamic configuration of partitioning
in spark applications. IEEE Transactions on Parallel
and Distributed Systems, 28:1891–1904.
Hyndman, R. J. and Athanasopoulos, G. (2018). Forecast-
ing: principles and practice. OTexts.
Karau, H., Konwinski, A., Wendell, P., and Zaharia, M.
(2015). Learning Spark: Lightning-Fast Big Data
Analysis. ” O’Reilly Media, Inc.”.
Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Ali, M.,
Kamaleldin, W., Alam, M., Shiraz, M., and Gani, A.
(2014). Big data: survey, technologies, opportunities,
and challenges. The Scientific World Journal, 2014.
Kumar, J. and Singh, A. K. (2018). Workload prediction
in cloud using artificial neural network and adaptive
differential evolution. Future Generation Computer
Systems, 81:41 – 52.
Fast Analysis and Prediction in Large Scale Virtual Machines Resource Utilisation
125