ual network and the dueling DQN algorithm are ap-
plied for obstacle detection and path planning, respec-
tively ( (Wen et al., 2020)). This algorithm can help
robots to recognize and keep away from static obsta-
cles in the complex environment. Wen et al. (Wen
et al., 2020) proposed a dual-depth q network obstacle
avoidance algorithm (D3QN), which can be trained
in the virtual environment, and directly applied in
the complex unknown environment (Xie et al., 2017).
The DRL algorithm based on the value function can
execute continuous action decisions unless the ac-
tion strategies are independent. Then, the strategy
gradient method is adopted for DRL. In robot path
planning, the strategy gradient algorithms mainly in-
clude TRPO, PPO, and DDPG. Lillicrap et al. pro-
posed a DDPG algorithm which apply the DQN esti-
mation function based on the DPG algorithm. It can
be utilized for continuous state and action space, and
straightforwardly improving the move stability (Lilli-
crap et al., 2015). Since TRPO algorithm have some
drawbacks, such as the strategy and environment are
too large and it can produce large errors easily. To
overcome aforedmentioned problems, Schulman et al.
proposed a neighborhood strategy optimization algo-
rithm based on the TRPO algorithm (Schulman et al.,
2017). The robot path planning algorithm which uti-
lized a random gradient replace the common policy
gradient to optimize the transformation of the objec-
tive function with sample data interacting with the en-
vironment, which has fine information robustness and
efficiency.
According to previous research, DRL algorithms
outperform the traditional algorithm in solving the
mobile robot path planning problem. In the dis-
crete action strategy scenario, the value function-
based DRL algorithm can make continuous action de-
cisions. However, the stability and convergence speed
via online learning is an urgent issue which is worth
to be investigating. Since it is difficult to solve a
sophistical multi-stage design problem with a single
algorithm. Multi-functional floor cleaning robot has
gained popularity in our daily life (Milinda and Mad-
husanka, 2017; Milinda and Madhusanka, ). More-
over, there are few types of researches on the plan-
ning and cleaning of the garbage sorting path plan-
ning. This paper is devoted to the research of the
multi-step path planning for a cleaning robot. The
main contributions of this paper can be summarized
as follows:
• 1. Proposed MDDPG algorithm to speed up
model convergence:
1.1 Adopt multi-strategy network, centralized
value network, value network receives data gen-
erated by multiple strategy networks at the same
time, and improves the efficiency of value net-
work estimation;
1.2 Divide the experience playback pool by pri-
ority to speed up the convergence of the model;
1.3 Design the reward function for the inter-
mediate starting point to improve the convergence
speed and degree of the reinforcement learning al-
gorithm;
• 2. The classification of garbage is defined, and
the garbage classification model is trained based
on an improved YOLOv5;
• 3. We build a multi-stage garbage path planning
model to improve the generalization of garbage
path planning problems.
The rest of this paper is organized as follows. The
DDPG is given in Section 2. Our proposed method
is in Section 3. The experimental results and analysis
are present in Section 4, and conclusions and future
work can be found in Section 5.
2 DEEP DETERMINISTIC
POLICY GRADIENT(DDPG)
Traditional reinforcement learning algorithms use ta-
bles to record value functions, and once dealing prob-
lems in cases with high states or action space ex-
ploded, it will cause a dimensionality disaster. On the
other hand, deep reinforcement learning parameter-
izes the value function or policy function and makes
full usage of the representation ability of the neu-
ral network to fit the value function or policy func-
tion. Therefore, scholars combine deep learning to
propose deep reinforcement learning. This improve-
ment enables deep reinforcement learning having a
good performance in cases with high-dimensional and
continuous-state spaces.
2.1 AC Network
The Actor-Critic framework(AC network) utilizes a
neural network to approximate the value function and
the policy function at the same time, its schematic di-
agram is shown in Figure 1. The AC network contains
two neural networks: actor-network and the critic net-
work. The actor-network is responsible for fitting the
current policy function and outputting corresponding
actions according to the state of the input, while the
critic network is responsible for estimating the value
function, according to the state or state of the input.
The action pair outputs the corresponding state or ac-
tion value. Actor and critic neural networks respec-
tively parameterize the policy function and the value
ISAIC 2022 - International Symposium on Automation, Information and Computing
758