2.1 Traditional methods
In the field of traditional computer vision, object
detection algorithms usually contain three parts:
choice of detection windows, feature selection and
classifier design. Algorithms of choosing detection
windows developed form sliding window based on
scales to Selective Search or Edge Box which is more
efficient to create region proposals because of multi
features. As for selecting features, there are many
classical methods, such as Local Binary Patterns
(LBP), Histogram of Oriented Gradient (HOG) and
so on. Support Vector Machine (SVM) or Decision
Tree is usually used as classifier in many object
detection systems. In general, traditional object
detection systems combine the three important parts
and perform well in some constrained scene. In this
paper, methods based on Faster R-CNN are used as
main procedure.
2.2 Evolution of Region-based CNNs
With the development of computing power, deep
convolutional neural networks have dominated many
tasks of computer vision. Deep learning can extract
abstract features by extensive data and repetitively
training, which can better express the key information.
The region-based convolution neural network has
well performances, so it has become the main
algorithm in the field of object detection.
In 2014, (Girshick, 2014) first proposed a region-
based CNN(R-CNN) for object detection. Compared
to tradition algorithms in object detection, it has high
accuracy of locating and classifying object. However,
it needs many feature extractors and SVM classifiers.
The training time is long. To mitigate these problems,
two methods, the SPP-Net (Kaiming, 2014) and the
Fast R-CNN (Girshick, 2015) have been proposed.
Instead of feeding each warped proposal image region
to the CNN, the SPP-Net and the Fast R-CNN run
through the CNN exactly once for the entire input
image. After mapping the proposals to the feature
maps of last convolutional layer, each proposal can
get scores of classes and coordinates in detection
layers. All of R-CNN, SPP-Net and Fast R-CNN rely
on the input generic object proposals, which come
from selective search (Uijlings, 2013). But it is
computationally intensive. To reduce the
computational burden of proposal generation, the
team of Ren proposed Faster R-CNN (Ren, 2015).
After Faster R-CNN is proposed, it is used in
many object detection applications. For example,
(Huaizu Jiang, 2016) proposed face detection
method based on Faster R-CNN. They reported state-
of-the-art results on two widely used face detection
benchmarks, FDDB and the recently released IJB-A.
3 PROPOSED METHOD
Faster R-CNN has become mainstream method in
many object detection applications. Faster R-CNN
abandons the traditional selective search algorithm to
extract proposed regions, and proposes Regional
Proposal Network (RPN), which is a fully
convolutional network. Faster R-CNN can be deemed
as the combination of Fast R-CNN and RPN.
Therefore, one network can extract region proposals
and locate objects. This network can run faster than
R-CNN and Fast R-CNN and can reach 8-9 frames
per second in professional GPU, such as NVIDIA
TITAN. The structure of Faster R-CNN is shown in
Figure 1.
This network contains the following three
important parts:
(1) Convolutional layers
At present, there are many CNN models,
including ZFNet (Zeiler, 2013), VGGNet (Simonyan,
2014), ResNet (Kaiming, 2016) and so on. In this
experiment, VGG16 and ResNet are selected as
convolution layers. Finally, comparison in
performance of two networks are made. The
convolution layer finally outputs feature maps with a
size of 1024 *51 * 28. By observing different
channels, different features are shown in Figure.2
Figure 1: Structure of Faster R-CNN.