Figure 8: Snapshots of the real-world experiment for the 1
vs 2 situation. After about 15 training episodes, the robot
found the optimal position to see the two enemies simulta-
neously. During the episode that the reward is the highest,
the robot started from the initial position, and then navi-
gated to the optimal place at the end of the first iteration.
The robot stayed at the optimal place during the rest of the
episode to get a maximal reward.
