tively. As expected, an increase of memory or com-
puting resources reduces the application completion
time. Meanwhile, more memory resources result in
a high cache hit ratio but computing resources have
little influence on it. We conclude that for most con-
fig combinations, our algorithm BLCR achieves the
lowest completion time and the highest hit ratio.
6 CONCLUSIONS
In this paper, we investigate the block-level cache
replacement problem for large-scale in-memory data
processing systems, with the application’s DAG taken
into consideration. To solve the problem, we develop
the algorithm BLCR based on the dynamic program-
ming technique. At last, trace-driven simulations are
conducted to evaluate the performance of BLCR and
measure the impact of scenario parameters. The result
shows its superiority over the state-of-the-art alterna-
tives. In the future work, we will further study the
block-level cache replacement problem and strike to
design a near-optimal approximation algorithm that
has the polynomial time complexity.
ACKNOWLEDGEMENTS
This work is supported by State Grid Jiangsu Tech-
nic Project “Research on Cloud Native Data Pro-
cessing Architecture based on Data Lake” (No.
SGJSXT00SGJS2200159).
REFERENCES
Abdi, M., Mosayyebzadeh, A., Hajkazemi, M. H., Turk, A.,
Krieger, O., and Desnoyers, P. (2019). Caching in the
Multiverse. In 11th USENIX Workshop on Hot Topics
in Storage and File Systems, HotStorage 2019.
Duan, M., Li, K., Tang, Z., Xiao, G., and Li, K. (2016). Se-
lection and replacement algorithms for memory per-
formance improvement in Spark. Concurrency and
Computation: Practice and Experience, 28(8):2473–
2486. Publisher: Wiley Online Library.
Geng, Y., Shi, X., Pei, C., Jin, H., and Jiang, W. (2017). Lcs:
an efficient data eviction strategy for spark. Interna-
tional Journal of Parallel Programming, 45(6):1285–
1297. Publisher: Springer.
Gottin, V. M., Pacheco, E., Dias, J., Ciarlini, A. E., Costa,
B., Vieira, W., Souto, Y. M., Pires, P., Porto, F., and
Rittmeyer, J. G. (2018). Automatic caching decision
for scientific dataflow execution in apache spark. In
Proceedings of the 5th ACM SIGMOD Workshop on
Algorithms and Systems for MapReduce and Beyond,
pages 1–10.
Li, H., Ghodsi, A., Zaharia, M., Shenker, S., and Stoica, I.
(2014). Tachyon: Reliable, memory speed storage for
cluster computing frameworks. In Proceedings of the
ACM Symposium on Cloud Computing, pages 1–15.
Li, M., Tan, J., Wang, Y., Zhang, L., and Salapura, V.
(2015). Sparkbench: a comprehensive benchmark-
ing suite for in memory data analytic platform spark.
In Proceedings of the 12th ACM international confer-
ence on computing frontiers, pages 1–8.
Lv, J., Wang, Y., Meng, T., and Xu, C.-Z. (2020). NLC: An
Efficient Caching Algorithm Based on Non-critical
Path Least Counts for In-Memory Computing. In
Cloud Computing - CLOUD 2020, pages 80–95.
Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L.
(1970). Evaluation techniques for storage hierarchies.
IBM Systems journal, 9(2):78–117. Publisher: IBM.
Nasu, A., Yoneo, K., Okita, M., and Ino, F. (2019). Trans-
parent In-memory Cache Management in Apache
Spark based on Post-Mortem Analysis. In 2019 IEEE
International Conference on Big Data (Big Data),
pages 3388–3396. IEEE.
Park, S., Jeong, M., and Han, H. (2021). CCA: Cost-
Capacity-Aware Caching for In-Memory Data Analyt-
ics Frameworks. Sensors, 21(7):2321.
Perez, T. B., Zhou, X., and Cheng, D. (2018). Reference-
distance eviction and prefetching for cache manage-
ment in spark. In Proceedings of the 47th Interna-
tional Conference on Parallel Processing, pages 1–10.
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy,
A., and Curino, C. (2015). Apache tez: A unifying
framework for modeling and building data process-
ing applications. In Proceedings of the 2015 ACM
SIGMOD international conference on Management of
Data, pages 1357–1369.
Wang, B., Tang, J., Zhang, R., Ding, W., and Qi, D. (2018).
LCRC: A dependency-aware cache management
policy for Spark. In 2018 IEEE Intl Conf on Parallel
& Distributed Processing with Applications, Ubiq-
uitous Computing & Communications, Big Data &
Cloud Computing, Social Computing & Network-
ing, Sustainable Computing & Communications
(ISPA/IUCC/BDCloud/SocialCom/SustainCom),
pages 956–963. IEEE.
Yang, Z., Jia, D., Ioannidis, S., Mi, N., and Sheng, B.
(2018). Intermediate Data Caching Optimization for
Multi-Stage and Parallel Big Data Frameworks. In
11th IEEE International Conference on Cloud Com-
puting, CLOUD 2018, pages 277–284.
Yu, Y., Wang, W., Zhang, J., and Letaief, K. B. (2017).
LRC: Dependency-aware cache management for data
analytics clusters. In IEEE INFOCOM 2017-IEEE
Conference on Computer Communications, pages 1–
9. IEEE.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,
and Stoica, I. (2010). Spark: Cluster computing with
working sets. In 2nd USENIX Workshop on Hot Top-
ics in Cloud Computing (HotCloud 10), volume 10,
page 95. Issue: 10-10.
Zhao, C., Liu, Y., Du, X., and Zhu, X. (2019). Research
cache replacement strategy in memory optimization of
spark. Int. J. New Technol. Res.(IJNTR), 5(9):27–32.
ISAIC 2022 - International Symposium on Automation, Information and Computing
260