A2C with WA is incapable of solving the problem. In
contrast, DAA is slightly affected by the number of
hosts but is not affected by the number of exploitable
services. When the number of services running per
machine increases, the DAA retains its performance
around 70%, while WA fails to resolve these scenar-
ios. Given these points, the data show that DAA out-
performs WA in large environments.
The reason for such results is because the DAA
uses two different agents to learn about the network.
When the number of services increases in the range
from 2 to 100, this complexity is still in processing
capability of the exploiting agent, so the performance
of DAA is almost not affected. Likewise, the struc-
turing agent is capable of handling a number of hosts
between 5 and 1024. In contrast, the agent of WA has
to deal with host and service proliferation at the same
time. Because the complexity increases exponentially
when both of these values increase, the performance
of WA drops rapidly to the point where it is no longer
capable of solving problems. Therefore, DAA is su-
perior to WA as well as other algorithms when per-
forming PT in complex environments.
6 CONCLUSIONS
The study is focusing on the investigation into the ap-
plication of RL to pentesting. The proposed archi-
tecture named double agent architecture is built based
on two separate agents in order to improve the per-
formance and accuracy of RL when applied to large
network environments. The paper also conduct ex-
periments to test the efficiency of DQN algorithms to
evaluate its use in PT problems.
The main contribution of the paper is to increase
RL’s ability to solve the PT problem when the net-
work is large. By dividing the PT problem into dif-
ferent subproblems including learning the structure of
the network and learning how to choose the appropri-
ate attack on the individual host, the double agent ar-
chitecture has been proven to be impressive efficient.
This method opens another approach when using RL
to solve PT problems in the future.
Experimental results show that DAA outperforms
other algorithms when solving PT problems with
large networks. The size of scenarios can reach up
to 1024 hosts and 100 services, and the DAA’s ability
to successfully attack sensitive hosts remains above
70%. With the number of exploitable services is less
than 10, the performance of this architect with a net-
work having 1024 hosts is up to 81%.
One of limitations of our work is that using the
network simulator which is a high level of abstrac-
tion will cause a gap between studying and applying
the problem in practice. The main direction for future
work is proposed to use a more realistic environment
such as VMs or real network as the input to the DAA.
Frameworks and tools such as Metasploit and Nes-
sus can be implemented to be able to more accurately
evaluate method results in practice.
REFERENCES
Chui, M., Manyika, J., and Miremadi, M. (2016). Where
machines could replace humans—and where they
can’t (yet). McKinsey Quarterly, 30(2):1–9.
Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag,
P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., De-
gris, T., and Coppin, B. (2015). Deep reinforcement
learning in large discrete action spaces. arXiv preprint
arXiv:1512.07679.
Ghanem, M. C. and Chen, T. M. (2020). Reinforcement
learning for efficient network penetration testing. In-
formation, 11(1):6.
Hasselt, H. V. (2010). Double q-learning. In Advances in
neural information processing systems, pages 2613–
2621.
Hoang, V. N., Hai, N. N., and Tetsutaro, U. (in press). Mul-
tiple level action embedding for penetration testing. In
Proceedings of the International Conference on Fu-
ture Networks and Distributed Systems.
Kaelbling, L. P., Littman, M. L., and Moore, A. W. (1996).
Reinforcement learning: A survey. Journal of artifi-
cial intelligence research, 4:237–285.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In International conference on machine learning,
pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Phillips, C. and Swiler, L. P. (1998). A graph-based system
for network-vulnerability analysis. In Proceedings of
the 1998 workshop on New security paradigms, pages
71–79.
Sarraute, C., Buffet, O., and Hoffmann, J. (2013). Pomdps
make better hackers: Accounting for uncertainty in
penetration testing. arXiv preprint arXiv:1307.8182.
Schwartz, J. and Kurniawati, H. (2019). Autonomous pen-
etration testing using reinforcement learning. arXiv
preprint arXiv:1905.05965.
Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M.,
and Freitas, N. (2016). Dueling network architectures
for deep reinforcement learning. In International con-
ference on machine learning, pages 1995–2003.
The Proposal of Double Agent Architecture using Actor-critic Algorithm for Penetration Testing
449