structure to extract image features. In addition, many
deep learning models are proposed, such as
Restricted Boltzmann Machine (RBM), Deep Belief
Networks (DBN) and Convolutional Neural
Networks (CNN). Among them, CNN is a multi-
layer neural network, which is the most popular and
effective network structure in the field of image
recognition. In this paper, a deep convolution neural
network based on Residual Neural Network
(ResNet) (Ren
S, He K, Girshick R, et al., 2015) and
Feature Pyramid Network (FPN) (Lin, 2017) as the
Network backbone, combined with RetinaNet, is
used to build the model.
3.1 Residual Neural Network (ResNet)
In the construction of convolutional neural network,
problems such as gradient diffusion and gradient
explosion occur in traditional deep neural network
structures, such as AlexNet, VGGNet and
GoogleNet. In order to solve these problems,
regularization initialization and intermediate
regularization layer are proposed. However, doing
so will lead to network degradation, so this paper
adopts deep residual network (ResNet) to extract
feature images.
When regularization is adopted to deal with the
gradient problem of deep network, its deep network
structure often becomes identity mapping, which
will degrade the deep network structure into shallow
network structure. However, it is difficult to
construct the network structure directly to fit these
potential identity mappings, namely, to construct
H(x) = x. It may also make it difficult for the
network to be trained. Therefore, the deep residual
network abandons fitting the identity mapping to fit
the residual. Its network structure is shown in Figure
6.
Figure 6. ResNet network structure.
ResNet constructs function F(x) through the
network, takes F(x) + x as the input value of the next
layer of network, uses F(x) + x to fit identity
mapping H(x): F(x) + x = H(x), then F(x) = H(x) - x.
So just set F(x) = 0 to fit this identity mapping. In
this paper, the ResNet network with 101 layers is
selected to extract image features.
3.2 Feature Pyramid Network (FPN)
After image features are extracted, general models
directly use the last layer of feature images of the
network, because the semantic information in the
last layer of feature images is strong, but the position
and resolving power in the last layer of feature
images are relatively low, and the position
information is more retained on the previous layer of
feature images. Therefore, this paper adopts feature
pyramid network (FPN), which makes use of feature
images of different levels, to process the features
obtained by ResNet.
FPN is to combine the characteristic image of
each layer of network output with the characteristic
image of one layer lower than it, and output the
characteristic image of fusion of several different
layers for prediction. In this way, different levels of
feature images are used to better analyze the location
information and semantic information. The structure
is shown in Figure 7. C1, C2, C3, C4 and C5 are
feature map of different levels output by resnet-101,
and P2 to P6 are feature map processed by FPN for
later prediction.
Figure 7. FPN structure.
3.3 RetinaNet
When the front and back scenes are judged by the
target detection, a large number of candidate boxes
are generated based on the randomly selected pixel
points in the picture. the candidate boxes are
classified to determine whether these candidate