2 PREVIOUS WORK
Human pose estimation in 2D images is usually tre-
ated as an object detection task, where the objects to
be detected are the skeleton joints of the people ap-
pearing in the images.
Felzenszwalb et al. (2010) proposed an object de-
tection system that uses local appearances and spatial
relations to recognize generic objects of an image.
Generally, this method consists of defining a model
that represents the object. The model is constructed
by defining a root filter (for the object) and a set of
filters (for the parts of the object). These filters are
used to study the features of the image. More specifi-
cally, the characteristics of the oriented gradient his-
togram (HoG) are analyzed within each filter to repre-
sent an object category. The descriptor calculates the
gradients of a region of the image HoG, assuming the
object within the image can be described by its inten-
sity transitions. This method uses a sliding window
approach, where the filters are applied to all image
positions. For the creation of the final model, a dis-
criminative approach is used, where the model learns
from annotated data, using bounding boxes around
the object. This part is usually performed by an sup-
port vector machine (SVM). After the training phase,
the model is used to detect the objects in test images.
Detection is performed by computing the convolution
of the trained part models with the feature map of the
test image and selecting the regions of the image with
the highest convolution score. One can notice that this
method, despite having a discriminative basis, can be
interpreted as an adjustment of the image to a model,
which involves generative concepts. For this reason,
it can be considered a hybrid methodology, and may
thus not be trivial to adapt this method to depth ima-
ges.
The random tree walk (RTW) method presented
by Jung et al. (2015) estimates 3D joints from depth
images. This work is an evolution of an earlier met-
hod proposed by Shotton et al. (2013). The main dif-
ference is in the fact that it does not apply a pixel re-
gression for all the pixels in the image and trains a
tree to estimate the direction to a specific joint from a
random point instead of the distance. RTW only eva-
luates one pixel at each iteration. When it reaches a
leaf in the tree, it will choose a direction. The RTW
method will then iteratively converge to the desired
joint. This method is executed hierarchically, which
means the position resulting from a joint search will
be used as the starting point for the next joint to be
calculated.
Regarding DL approaches, Cao et al. (2017) pro-
posed a method that uses a VGG (Simonyan and Zis-
serman, 2015) network to extract features from the
image and these features are used as inputs for a CNN
with two branches. The first branch is trained and
used for joint detection and the second branch is trai-
ned with the segments between them, so it is able to
detect limbs connecting joints. In the first branch, a
feed-forward network is used to provides the confi-
dence maps of the different parts of the body corre-
sponding to their probability maps. These probability
maps are a representation of the confidence in each
position of the joint that occurs in each pixel and is ex-
pressed in a Gaussian function. In the second branch,
the part affinity vector fields are constructed, enco-
ding the association between the parts. The part affi-
nity fields allow joint’s positions to be assembled into
a body posture. A part affinity field is constructed
for each member of the body and encodes location
and orientation information. The predictions for joint
and limb detections produced by the two branches of
the network are refined over several stages through an
iterative process. The predictions of each branch are
used as the input of the next stage. This method is
designed to better handle images with more than one
person. For this reason, it is unnecessary to imple-
ment a method for detecting people, to later detect the
joints of each person, which allows to avoid bad de-
tections on the people detector and increases the com-
putation time. As major disadvantages, it requires sig-
nificant training data and requires the analysis of the
entire image.
The method presented by Papandreou et al. (2017)
consists of a two-stage approach. The first stage pre-
dicts the location and scale of bounding boxes contai-
ning people using a Faster R-CNN (Ren et al., 2017)
detector. Both the region proposal components and
the bounding box classification used in the Faster R-
CNN detector were trained using only the person ca-
tegory of the MS COCO (Lin et al., 2014) dataset,
with all other categories being ignored. In the se-
cond step, for each bounding box proposed in the first
step, the 17 key points of the person potentially con-
tained in the box are estimated. For better computa-
tional efficiency, the bounding box proposals of pe-
ople are only sent to the second stage if their score
is higher than a threshold (0.3). Using a fully convo-
lutional ResNet, the system predicts two targets, (1)
disk-shaped heatmaps around the key points and (2)
magnitude of the offset fields toward the precise posi-
tion within the disk. The system then aggregates these
results, producing the activation maps aggregating the
results in a weighted voting process, on highly locali-
zed activation maps.
The method presented by He et al. (2017), na-
med Mask R-CNN, is an extension of Faster R-CNN
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
282