multi-threaded master is improved by 35% and fur-
ther improved by 33% with the addition of the wait-
free queue slave. The peak performance of the two
optimizations is improved by 80% compared to the
original LWFS. The average performance improve-
ment is 13% for the multi-threaded master and 19%
for the wait-free queue slave.
Bandwidth(MBbps)
0
50
100
150
200
250
Computer node #
1 4 8 16 24 32
GFS
LWFS
multi-threaded Master
wait-free-queue Slave
Figure 7: Results of 4KB Block Random Read Aggregate
Bandwidth Test.
The above results show that the Master/Slave
model of LWFS after the stack of the two optimiza-
tion methods is more stable and efficient in terms
of large block sequential write performance in high
concurrency scenarios, and there is almost no perfor-
mance loss in peak aggregation bandwidth compared
to GFS in high concurrency scenarios, achieving a
lossless global storage performance output.
5 CONCLUSIONS
The The Master/Slave model is a common parallel
mechanism for global storage in supercomputing, but
the limitations of this mechanism are becoming more
and more prominent in today’s increasing network
performance and concurrency of supercomputing ap-
plications. In this paper, we propose two performance
optimization methods to address this issue: multi-
threaded optimization for master and wait-free queue
optimization for slave. Evaluations of the two opti-
mization methods in the Sunway E-class Prototype
Verification System show that the peak bandwidth of
1M block sequential read/write is improved by 16%
and 90% respectively, and the peak bandwidth of 4K
block random read/write is improved by 80% and
48% respectively. The performance of global stor-
age is more stable under concurrent scenarios after
the two optimization methods are stacked. The eval-
uation results prove that the optimization mechanism
proposed in this paper can effectively solve the prob-
lem of excessive performance overhead of the tradi-
tional Master/Slave model, and can better adapt to the
I/O requirements of applications under high concur-
rency scenarios.
REFERENCES
Arbel, M. and Attiya, H. (2014). Concurrent updates with
rcu: Search tree as an example. Proceedings of the
Annual ACM Symposium on Principles of Distributed
Computing.
Bez, J. L., Miranda, A., Nou, R., Boito, F. Z., Cortes, T.,
and Navaux, P. (2021). Arbitration policies for on-
demand user-level i/o forwarding on hpc platforms.
In International Parallel and Distributed Processing
Symposium.
Chen, Q., Chen, K., Chen, Z.-N., Xue, W., Ji, X., and Yang,
B. (2020). Lessons learned from optimizing the sun-
way storage system for higher application i/o perfor-
mance. Journal of Computer Science and Technology,
35:47–60.
Chen, Y. and Lu, Y. (2019). Scalable rdma rpc on reliable
connection with efficient resource sharing. pages 1–
14.
Duan, H. C., Xian-Liang, L. U., and Song, J. (2004). Analy-
sis and design of communication server based on epoll
and sped. Computer Applications.
Gammo, L., Brecht, T., Shukla, A., and Pariag, D. (2004).
Comparing and evaluating epoll, select, and poll event
mechanisms. Proceedings of the 6th Annual Ottawa
Linux Symposium.
IOR. HPC IO benchmark repository. https://github.com/
hpc/ior.
Ji, X., Yang, B., Zhang, T., Ma, X., Zhu, X., Wang,
X., El-Sayed, N., Zhai, J., Liu, W., and Xue, W.
(2019). Automatic, Application-Aware I/O forward-
ing resource allocation. In 17th USENIX Conference
on File and Storage Technologies (FAST 19), pages
265–279, Boston, MA. USENIX Association.
Jiangang Gao, Hongsheng Lu, W. H. e. a. (2021). The inter-
connection network and message machinasim of sun-
way exascale prototype system. Chinese Journal of
Computers, 44:222–234.
Leung, J. and Zhao, H. (2008). Scheduling problems in
master-slave model. Annals OR, 159:215–231.
Rao, N., Imam, N., Hanley, J., and Oral, S. (2018). Wide-
area lustre file system using lnet routers. pages 1–6.
Yamazaki, H., Nagatani, M., Hamaoka, F., Kanazawa,
S., Nosaka, H., Hashimoto, T., and Miyamoto, Y.
(2017). Discrete multi-tone transmission at net data
rate of 250 gbps using digital-preprocessed analog-
multiplexed dac with halved clock frequency and sup-
pressed image. Journal of Lightwave Technology,
PP:1–1.
Yildiz, O., Dorier, M., Ibrahim, S., Ross, R., and Antoniu,
G. (2016). On the root causes of cross-application i/o
interference in hpc storage systems. pages 750–759.
Optimization of the Master/Slave Model in Supercomputing Global Storage Systems
421