traction, CNN (Convolutional Neural Networks) fea-
ture computation and bounding-box regression. The
proposed RCNN uses a selective search algorithm
(van de Sande et al., 2011) to extract 2000 region pro-
posals from the input image. Each candidate region
proposal fed into CNN to produce features as output.
Consider that a large number of overlapping regions
take a huge amount of time to train the network, re-
sulting in a waste of computing resources and leads
to an extremely slow detection speed. Furthermore,
RCNN could lead to the generation of bad candidate
region proposals since it uses the selective search al-
gorithm which is a slow and time-consuming process
affecting the performance of the network. Hence,
to solve some of R-CNN drawbacks, Spatial Pyra-
mid Pooling Networks (SPPNet) (He et al., 2014)
was proposed by k. He et al. Unlike the previous
CNN models involving a fixed-size of the input im-
age, SPPNet uses a Spatial Pyramid Pooling (SPP)
layer allowing a CNN to produce a fixed-length rep-
resentation where any image sizes can be inputted. In
spite of its improvements over RCNN model, there
are sitll some disadvantages: (1) the training stage
is too slow, (2) SPPNet focuses only on fine-tuning
its fully connected layers whereas all previous lay-
ers are neglected. In 2015, Ross took into considera-
tion these limitations and has proposed Fast-RCNN
(Girshick, 2015), which makes the class classifica-
tion faster. The input image feeds into a CNN to
generate a convolutional feature map. The region of
proposals are determined directly from the convolu-
tional feature map where Fast-RCNN integrates a RoI
pooling layer to reshape the identified region proposal
into a fixed size making the classification faster but it
still relies on selective search which can take around
2 seconds per image to generate bounding box pro-
posals. Thus, it has high mAP but it can’t meet real-
time detection. Faster-RCNN (Ren et al., 2015) re-
places a selective search algorithm and integrates an
RPN branch networks to predict the region proposals.
These solutions have improved the speed of Faster-
RCNN but it is still difficult to meet the real-time
engineering requirements. Compared with the two-
stage detection approaches, the one-stage detection
approaches often involves finding the right trade-off
between accuracy and computational efficiency. The
SSD (Liu et al., 2015) is a common object detection
algorithm which performs a single forward pass of
the network to locate and identify multiple objects
within the input image. Therefore, it achieves good
speed efficiency compared with two-shot RPN-based
approaches. After continuous iterative improvement
of YOLO, Joseph Redmon proposed YOLO V3 (Red-
mon and Farhadi, 2018) which is three times faster
than SSD. For 320x320 images, the detection speed of
YOLO V3 can reach 22ms. Considering the variabil-
ity in size and position of objects within the digitized
herbarium specimens images, it is more appropriate
to use YOLO V3 as the target detection network be-
cause it offers a very fast operation speed with good
accuracy to predict the objects within the DHS im-
ages. However, YOLO V3 often struggled with small
and occluded objects. To address this issue, we pro-
posed an automatic object detection method based on
an improved YOLO V3 deep neural network, which
is developed, based on the Darknet framework. The
proposed approach uses the last four scales of feature
maps, which are rich in detail localization informa-
tion to detect small and occluded objects from the
DHS (figure 2). Furthermore, we adopted the fourth
detection layer by a 4* up-sampled layer instead of
2* to get a feature map with higher resolution and
lower level. The improved YOLO V3 was trained on
data provided by the herbarium Haussknecht in Ger-
many. The experimental results show very high de-
tection speed and accuracy under the same detection
time.
2 PROPOSED APPROACH
YOLO V3 is the third generation of You Only Look
Once (YOLO). YOLO was originally proposed by
Joseph Redmon of Washington University where the
algorithm uses the Google LeNet model designed by
Google to realize end-to-end object detection. The
core idea of YOLO is to divide the input image into
grid cells of the same size. If the center point of
the object’s ground truth falls within a certain grid,
the grid is responsible for detecting the target. Note
that each grid generates K anchor boxes of differ-
ent scales and outputs B prediction bounding boxes,
including position information of the bounding box
(center point coordinates x, y, width w, height h), and
prediction confidence. To alleviate the defect of the
previous generation of YOLO, YOLO V3 integrates
residual network and adds batch normalization (BN)
layer and Leaky ReLU layer after each convolution
layer. At the same time, YOLO V3 adopts a multi-
scale prediction method similar to FPN (Lin et al.,
2016) (Feature Pyramid Networks) network to have
a better detection effect for large, medium and small
targets. As presented in figure 1, it uses three scales
of prediction (13 x 13, 26 x 26 and 52 x 52) in order
to output different sizes of feature maps. On the other
hand, YOLO V3 borrows the idea of using dimension
clusters as anchor boxes (Ren et al., 2015) for predict-
ing bounding boxes of the system. It uses nine cluster
VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications
524