Real-time Human Pose Estimation with Convolutional Neural Networks

Marko Linna

, Juho Kannala

and Esa Rahtu

University of Oulu, Finland

Aalto University, Finland

Tampere University of Technology, Finland

Keywords:

Human Pose Estimation, Person Detection, Convolutional Neural Networks.

Abstract:

In this paper, we present a method for real-time multi-person human pose estimation from video by utilizing

convolutional neural networks. Our method is aimed for use case speciﬁc applications, where good accuracy

is essential and variation of the background and poses is limited. This enables us to use a generic network

architecture, which is both accurate and fast. We divide the problem into two phases: (1) pre-training and

(2) ﬁnetuning. In pre-training, the network is learned with highly diverse input data from publicly available

datasets, while in ﬁnetuning we train with application speciﬁc data, which we record with Kinect. Our method

differs from most of the state-of-the-art methods in that we consider the whole system, including person

detector, pose estimator and an automatic way to record application speciﬁc training material for ﬁnetuning.

Our method is considerably faster than many of the state-of-the-art methods. Our method can be thought of as

a replacement for Kinect in restricted environments. It can be used for tasks, such as gesture control, games,

person tracking, action recognition and action tracking. We achieved accuracy of 96.8% (PCK@0.2) with

application speciﬁc data.

1 INTRODUCTION

Human pose estimation in unconstrained environment

is a problem where humans yet perform better than

computers. In recent years, the research has moved

from traditional methods (Felzenszwalb et al., 2008;

Andriluka et al., 2009; Yang and Ramanan, 2011;

Sapp and Taskar, 2013) towards convolutional neural

networks (ConvNets) (Jain et al., 2013; Toshev and

Szegedy, 2014; Pﬁster et al., 2014; Jain et al., 2014;

Carreira et al., 2015; Pishchulin et al., 2015; Pﬁster

et al., 2015; Tompson et al., 2015; Lifshitz et al.,

2016; Wei et al., 2016; Newell et al., 2016; Charles

et al., 2016). Due to this, signiﬁcant improvements in

accuracy have been accomplished. ConvNets became

popular, when AlexNet (Krizhevsky et al., 2012) was

introduced. AlexNet could classify images on diffe-

rent categories. Since then, several more efﬁcient net-

work architectures have been proposed, for both clas-

siﬁcation and human pose estimation.

Many state-of-the-art ConvNet human pose esti-

mation methods uses more complex network archi-

tectures and they perform considerably well in un-

constrained environments (Lifshitz et al., 2016), (Ne-

well et al., 2016), (Insafutdinov et al., 2016), where

large variations in pose, clothing, view angle and

background exists. While these methods have high

accuracy, they are usually slow considering real time

pose estimation. Recent research (Toshev and Sze-

gedy, 2014), (Pﬁster et al., 2014) shows that by using

a generic ConvNet architecture, a competitive accu-

racy can be achieved, while still maintaining a fast

forward pass time. This is the main motivation of our

research. With our method, we don’t aim for overall

human pose estimation in diverse input data, but rat-

her target to speciﬁc use cases where high accuracy

and speed are required. In such cases, the problem

is different, because the environment is usually con-

strained, persons are in close proximity of the camera

and poses are restricted. Possible application for our

method are, for instance, gesture control systems and

games.

Our method is a multi-person human pose estima-

tion system, targeted for use case speciﬁc applicati-

ons. In order to support multiple people, we use a

person detector, which gives locations and scales of

the persons in the target image. This brings our met-

hod towards the practice, since person location and

scale are not expected to be known, which is the case

with many state-of-the-art methods (Lifshitz et al.,

2016), (Wei et al., 2016), (Newell et al., 2016). We

use a generic ConvNet architecture, with eight layers.

The key idea of our method is to pre-train the net-

work with highly diverse input data and then ﬁnetune

Linna, M., Kannala, J. and Rahtu, E.

Real-time Human Pose Estimation with Convolutional Neural Networks.

DOI: 10.5220/0006624403350342

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

335-342

ISBN: 978-989-758-290-5

335

it with use case speciﬁc data. We show that compe-

titive accuracy can be achieved in application speci-

ﬁc pose estimation, while operating in real-time. Our

method can be used for higher level tasks, for exam-

ple, gesture control, gaming, action recognition and

action tracking.

The main contributions of our method are: (1)

utilization of person detector to crop person cente-

red images in both training and testing, thus enabling

multi-person pose estimation in real world images,

(2) ability to learn from heterogeneous training data,

where the set of joints is not the same in all the trai-

ning samples, thus enabling to use more varied data-

sets in training, (3) utilization of Kinect for automatic

training data generation, thus making it easy to gene-

rate large amount of annotated training data, (4) so-

mewhat slower and less accurate depth sensor free al-

ternative for Kinect (Shotton et al., 2013) in restricted

environments. The frame rate of our method is about

13 Hz, when with Kinect it is 15 or 30 Hz, depending

on the lighting conditions. Our method works with

RGB cameras while Kinect needs also a depth sensor.

2 RELATED WORK

Jain et al. (Jain et al., 2013) demonstrated that Con-

vNet based human pose estimation can meet the per-

formance, and in many cases outperform, traditional

methods, particularly deformable part models (Fel-

zenszwalb et al., 2008) and multimodal decomposa-

ble models (Sapp and Taskar, 2013). Their network

architecture consisted of three convolutional layers,

followed by three fully connected layers. They trai-

ned the network for each body part (e.g. wrist, shoul-

der, head) separately. Each network was applied as

sliding windows to overlapping regions of the input

image. A window of pixels was mapped to a single

binary output: the presence or absence of that body

part. This made possible to use much smaller net-

work, at the expense of having to maintain a separate

set of parameters for each body part.

Another application to human pose estimation was

presented by Toshev and Szegedy (Toshev and Sze-

gedy, 2014). Their network architecture was simi-

lar to AlexNet (Krizhevsky et al., 2012), but the last

layer was replaced by a regression layer, which out-

put joint coordinates. In addition to this, they trai-

ned a cascade of pose regression networks. The cas-

cade started off by estimating an initial pose. Then

at subsequent stages, additional regression networks

were trained to predict a transition of the joint loca-

tions from previous stage to the true location. Thus,

each subsequent stage reﬁned the currently predicted

pose. Similar idea is applied in more recent work by

Carreira et al. (Carreira et al., 2015).

A video based human pose estimation method was

introduced by Pﬁster et al. (Pﬁster et al., 2014). Their

method utilized the temporal information available in

constrained gesture videos. This was achieved by trai-

ning the network with multiple frames so that the fra-

mes were inserted into the separate color channels

of the input. The network architecture was similar

to AlexNet, having ﬁve convolutional layers, follo-

wed by three fully connected layers, from which the

last one was a regression layer. However, there were

some differences compared to the previous architectu-

res. Some of the convolutional layers were much dee-

per and pooling was non-overlapping, when in most

of the previous architectures it was overlapping. The

network produced signiﬁcantly better pose predicti-

ons on constrained gesture videos than the previous

work. For this reason, we base our method to this

network architecture.

3 METHOD

Our method is targeted for video inputs. The rough

steps for a single video frame in testing are: (1) detect

persons, (2) crop person centered images, (3) feedfor-

ward person images to the pose estimation network.

We use an object detector (Ren et al., 2015) to solve

person bounding boxes from the input frame. The

pose estimation is done for each person individually.

As a result of the pose estimation, our network out-

puts locations of body keypoints.

We pre-train our network by using data from mul-

tiple publicly available datasets, thus offering good

initialization values for ﬁnetuning. We evaluate pre-

training and ﬁnetuning separately. For the evaluation

of the ﬁnetuning, we use data recorded with Kinect.

As for ConvNet framework, we use Caffe (Jia et al.,

2014) with small modiﬁcations.

3.1 Person Detection

Our method utilize Faster R-CNN (F-RCNN) (Ren

et al., 2015) to detect persons from training and tes-

ting images. The forward pass time of the F-RCNN is

60ms or 200ms, depending on the used network. We

use the slower and more accurate model.

We noticed that sometimes F-RCNN gives false

positives. This is not a problem in training, since we

use both the ground truth and the F-RCNN together to

crop the training image. But in testing, the pose esti-

mation is also performed for false positives. However,

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

336

Table 1: Overview of used datasets in pre-training. Only the training set of the MPII Human Pose is used, because the

annotations are not available for the test set. In the BBC Pose, the training set is annotated semi-automatically (Buehler et al.,

2011), while the test set is manually annotated. We use only manually annotated data from the BBC pose. We use data

augmentation to expand the number of training images.

Person boxes we use

from the dataset

Person boxes we use for

pre-training and validation

Annotated

pointsDataset Train Test Total Train (aug.) Validation

MPII Human Pose (Andriluka et al., 2014) 1-16 28821 0 28821 71018 1160

Fashion Pose (Dantone et al., 2013) 13 6530 765 7295 14538 694

Leeds Sports Pose (Johnson and Everingham, 2010) 14 1000 1000 2000 5074 146

FLIC (Sapp and Taskar, 2013) 11 3987 1016 5003 14780 0

BBC Pose (Charles et al., 2013) 7 0 2000 2000 6764 0

40338 4781 45119 112174 2000

most likely these false positives could be ﬁltered, es-

pecially with use case speciﬁc images, by adjusting

the parameters of the F-RCNN. In the evaluation, we

use also the ground truth to decide if the frame has a

person or not, so it is guaranteed that all the evalua-

tion frames contain a person. Apart from this, we ran

the F-RCNN for the original ﬁnetuning evaluation fra-

mes, where the ground truth was not yet used for the

frame selection. This resulted in false positive rate of

2.86% and false negative rate of 0.65%. In all of the

original evaluation frames, there is one fully visible

person making gestures in constrained environment.

Person detection was considered false if the resulted

bounding box did not contain a person, or if it had par-

tially visible person on the edges of the bounding box.

In other words, if the intersection-over-union (IoU)

ratio between the detection and the ground truth was

0.5 or less.

3.2 Data Augmentation

The F-RCNN person detector is applied for each

training image. For each detected person, the IoU

between the detected person bounding box and the

expanded ground truth bounding box is calculated.

The expanded ground truth person box is the tightest

bounding box, including all the joints, expanded by a

factor of 1.2. The person box having the biggest IoU

is selected as the best choice. Based on the best IoU,

Table 2: The relation between the person box overlapping

ratio and the data augmentation.

Person box type used in augmentation

Overlapping ratio F-RCNN Ground truth

IoU > 0.7 X

IoU < 0.5 X

0.5 ≥ IoU ≤ 0.7 X X

the training image is augmented by using either of the

person bounding boxes, or both (see Table 2).

In practice, this means that if the detected person

box is near to the ground truth expanded person box,

only the former is used to crop the person image. And

if the detected person box is far from the ground truth

expanded person box, only the latter is used to crop

the person. And in between of these, both person

boxes are used to crop the person, resulting in two

training images, where both have small differences in

translation and scale. The shortest side of a person

box is expanded to equal the longest side, resulting a

square crop area, deﬁning the person image used in

training. Zero padding is added where needed. A sin-

gle cropped person image is rescaled to size 224×224

before feeding it to the network.

In addition to aforementioned, a training image is

augmented by doing a horizontal ﬂip. All in all, a sin-

gle person image from a source dataset can result in

either two or four augmented person centered training

images.

3.3 Pre-training

We pre-train the model from scratch by using se-

veral publicly available datasets (see Table 1). The

number of annotated joints varies between the data-

sets. The MPII Human Pose (Andriluka et al., 2014),

Fashion Pose (Dantone et al., 2013) and Leeds Sports

Pose (Johnson and Everingham, 2010) have full body

annotations, while the FLIC (Sapp and Taskar, 2013)

and BBC Pose (Charles et al., 2013) have only upper

body annotated. Since we use a single point for the

head, and because the MPII Human Pose and Leeds

Sports Pose have annotations for the neck and head

top, we take the center point of these and use it as a

head point.

As we aim to study that whether additional parti-

ally annotated training data brings improvement over

Real-time Human Pose Estimation with Convolutional Neural Networks

337

Figure 1: Example pose estimations with the pre-trained

network. Samples are taken randomly from the testing set of

the MPII Human Pose dataset. The green bounding boxes

are the results of person detection and the number on the

top-left corner is the probability of a box containing a per-

son.

using only fully annotated samples, our validation

samples should be fully annotated. Thus, we put all

the fully annotated (13 joints) person images to a sin-

gle pool and sample 2000 images randomly for vali-

dation. The validation images are then removed from

the pool. Next, we put all the partially annotated ima-

ges to the same pool so that it eventually contains

person images with heterogeneous set of annotated

joints. Then we use the pool in training. The purpose

of the pre-trained model is to offer a good weight ini-

tialization values for ﬁnetuning. Pre-training takes 23

hours on three NVIDIA Tesla K80 GPUs. Fig. 1 con-

tains example pose estimations with the pre-trained

network.

3.4 Finetuning

The purpose of the ﬁnetuning is to adapt the pre-

trained model for the particular use case. For instance

a gesture control system or a game. The pre-trained

model alone is not a good enough pose estimator for

our use cases, because the shallow network we use

lacks the capacity to perform well with highly diverse

training data. More complicated network architectu-

res, such as (Newell et al., 2016), (Lifshitz et al.,

2016) would certainly give better results, but then the

speed gain achieved with shallow network architec-

ture would most likely be lost.

In ﬁnetuning, the pre-trained model is used for

weight initialization. When the network is ﬁnetuned

Figure 2: Example pose estimations with the ﬁnetuned

network. Predictions are in red and Kinect ground truth

in green. On the columns are ﬁve different frames from

the evaluation data. The ﬁrst row shows results of the

full ﬁnetuning (experiment 3) and the second row shows

results of the phase 1 (experiment 1). Experiments are

explained later in Section 4. Full videos are available

at https://youtu.be/qjD9NBEHapY and https://youtu.be/e-

P5SYL-Aqw.

with use case speciﬁc data, for example to estimate

poses in gesture control system, the training data is

most likely consistent. This is a good thing when

thinking of accuracy. Even a shallow network can

produce very good estimations, if the training data is

limited to particular use case. Using more complica-

ted, and potentially slower, network architectures in

these situations is therefore not necessary. We use Ki-

nect in our experiments to produce annotations for the

ﬁnetuning data, but alternative methods can be consi-

dered as well. Fig. 2 contains example pose estimati-

ons with the ﬁnetuning evaluation data.

3.5 Network Architecture

Our method utilizes generic ConvNet architecture,

having ﬁve convolutional layers followed by three

fully connected layers, from which the last layer is re-

gression layer (see Fig. 3). The regression layer pro-

duces (x, y) position estimates for human body joints.

More closely, one estimation for head, six for arms

and six for legs, a total of 13 position estimations.

The network input size is 224 × 224 × 3. We use ge-

neric ConvNet architecture, because it has shown to

perform well in human pose regression tasks (Tos-

hev and Szegedy, 2014), (Pﬁster et al., 2014). The

forward pass time of the network is 16ms on Nvidia

GTX Titan GPU, which makes it highly capable for

real-time tasks.

3.6 Training Details

In model optimization, the network weights are upda-

ted using batched stochastic gradient descent (SGD)

with momentum set to 0.9. In pre-training, where the

network is trained from scratch, the learning rate is set

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

338

Conv 1

k=7, s=2

Conv 2

k=5, s=1

Conv 3

k=3, s=1

109x109x96

33x33x256

17x17x512

Conv 4

k=3, s=1

17x17x512

Conv 5

k=3, s=1

17x17x512

4096

224x224x3

6x6x512

33x33x96

17x17x256

Convolution

Max pooling Fully connected

Local Response Normalization

109x109x96

k=3, s=3

k=2, s=2 k=3, s=3

Figure 3: The architecture of the pose estimation network. Letters k and s denotes kernel size and stride.

to 10

−2

, weights are initialized randomly using Xa-

vier algorithm (Glorot and Bengio, 2010) and biases

are set to zero. In ﬁnetuning, the learning rate is set to

−3

. The loss function we use in optimization, pe-

nalizes the distance between predictions and ground

truth. We use weighted Euclidean (L2) loss

E =

∑

i=1



− x

pred



(1)

where vectors w, x

and x

pred

holds joint coordi-

nates and weights in form of (x

, y

, x

, y

, ..., x

). Weight w

is set to zero if the ground truth of

the joint coordinate x

is not available. Otherwise it

is set to one. This way only the annotated joints con-

tribute to the loss. This enables training the network

using datasets having only the upper body annotati-

ons, along with datasets having full body annotations.

Ability to utilize heterogeneous training data, where

the set of joints is not the same in all training sam-

ples, potentially leads to better performance as more

training data can be used.

As for comparison, we train the pre-trained mo-

del also without using the weighted Euclidean loss.

In this case, we use only images with fully annota-

ted joint positions (13 joints), so that the training data

is homogenous regarding to joint annotations. Doing

this reduces the size of the training data from 112174

to 66598 images. The average joint prediction error

with heterogeneous and homogenous data are 15.7

and 16.6 pixels on 224 × 224 images. With hetero-

geneous data, there is about 5% improvement on pre-

diction error.

In batched SGD, we use batch size of 256. Each

iteration selects images for the batch randomly from

the full training set. A training image contains

roughly centered person of which joints are annota-

ted. The training images are resized to 224 × 224 be-

fore feeding to the network. Mean pixel value of 127

is reduced from every pixel component and the pixel

components are normalized to range [-1, 1]. Joint an-

notations are normalized to range [0, 1], according to

the cropped person centered image.

3.7 Testing Details

The person detector is applied for an image from

which poses are to be estimated. Person images are

cropped based on detections as described earlier. In

addition, for each person image, a horizontally ﬂip-

ped double is created. Both the original and the dou-

bled person images are fed to the network. The ﬁnal

joint prediction vector is average of the estimations of

these two (the predictions of the doubled image are

ﬂipped so that they correspond predictions of the ori-

ginal image). By doing this, a small gain in accuracy

is achieved.

4 EVALUATION

We evaluate pre-training and ﬁnetuning with the per-

centage of correct keypoints (PCK) metric (Sapp and

Taskar, 2013), where the joint location estimate is

considered correct, if its L2 distance to the ground

truth is at most 20% of the torso length. The torso

length is the L2 distance between the right shoulder

and the left hip.

We use 2000 randomly taken samples for the eva-

luation of the pre-training. For ﬁnetuning, we record

data with Kinect for Windows v2 (see Table 3). We

use the joint estimates produced by Kinect as a ground

truth. We made sure that the data was recorded in

a such way, that the error in the joint estimations is

minimal. Practically this means good lightning con-

ditions, no extremely rapid movements and no major

body part occlusions. The gestures performed in the

data tries to mimic different gesture control events,

where the hands are used for tasks like object se-

lection, moving, rotating and zooming, in addition to

hand drawing and wheel steering.

For the evaluation of the ﬁnetuning, we record

additional 4000 frames with identical clothing . We

do three ﬁnetuning experiments, using different set of

training frames in each case (See Table 4). The expe-

riments 1 and 2 together uses the same training frames

Real-time Human Pose Estimation with Convolutional Neural Networks

339

as the experiment 3. Basically, the experiment 3 is the

same as the experiments 1 and 2 performed consecu-

tively. The purpose of this divide is to see the effect

of using the same/different clothing between the trai-

ning and testing data. The experiment 1 express more

of the ability of generalization (for all people) while

the experiments 2 and 3 of speciﬁcity (for certain pe-

ople).

The results are displayed in Fig. 4 and Table 5. In

full ﬁnetuning (experiment 3), with the use case spe-

ciﬁc data, the accuracy of 96.8% is achieved. In ﬁne-

tuning phase 1 (experiment 1), where no same clo-

thing occurs between the training and testing data,

the accuracy is 90.6%. However, if we look at the

accuracy of wrist (pre-train: 24.5%, phase 1: 67.4%,

full: 89.2%), which is the most challenging body joint

to estimate, but perhaps also the most important one

considering a gesture control system, we can see that

additional case speciﬁc training data can signiﬁcantly

improve the accuracy and make the system usable in

practice. This originates partially from the ﬁnetu-

ning data, where the wrist location variation is big-

gest. We believe, that if more training data would be

used, and perhaps a better data augmentation, a bet-

ter wrist accuracy could be achieved with the current

network architecture. After all, the wrist accuracy is

still decent, making our method useful for many use

cases.

The results indicate that a trade-off between gene-

ralization and speciﬁcity exists between pre-training

and ﬁnetuning. This can be seen by comparing

accuracies between the pre-trained and ﬁnetuned net-

works, ﬁrst with the pre-train validation samples and

then with the ﬁnetuning validation samples. The pre-

train validation samples express the case of generali-

zation as they contain a large variation of persons and

poses in unconstrained environment. On the contrary,

the ﬁnetuning validation samples reﬂects the case of

Table 3: Kinect recorded ﬁnetuning data for training. All

the frames have similar background, person and gestures,

but clothing differs. For the evaluation, we additionally re-

cord 4000 frames, which have identical clothing (clothing

number 1).

Clothing Frames

1 27222

2 18760

3 20244

4 20726

5 10560

6 11666

7 10136

119314

speciﬁcity as they have restricted poses in constrained

environment. After the full ﬁnetuning, the accuracy

on the pre-train validation set drops from 63.1% to

44.2% (light red and dark red curves in Fig. 4), while

in the same time, the use case speciﬁc accuracy incre-

ases from 69.6% to 96.8% (blue and magenta curves).

In certain cases, the loss in generalization is accepta-

ble, if at the same time, gain in speciﬁcity is achie-

ved. One example of a such case is a gesture control

system set up in a factory, where all the persons wear

identical clothing. Most importantly, while generic

person detection in highly varying poses and contexts

is an important and challenging problem, our results

show that in some use cases the state-of-the art for the

generic problem may produce inferior results compa-

red to a simpler approach which has been speciﬁcally

trained for the problem at hand.

5 CONCLUSION

We introduced a real-time ConvNet based system

for human pose estimation and achieved accuracy of

96.8% (PCK@0.2) by ﬁnetuning the network for spe-

ciﬁc use case. Our method can be thought of as a

replacement for Kinect in restricted environments. It

can be used in various tasks, like gesture control, ga-

ming, person tracking, action recognition and action

tracking. Our method supports heterogeneous trai-

ning data, where the set of joints is not the same in all

the training samples, thus enabling utilization of dif-

ferent datasets in training. The use of a separate per-

son detector brings our method towards the practice,

where the person locations in the input images are not

expected to be known. In addition, we demonstrated

an automatic and easy way to create large amounts

of annotated training data by using Kinect. The net-

work forward time of our method is 16ms, without the

person detector and with the person detector, either

60+16=76ms or 200+16=216ms.

Table 4: Finetuning experiments. The training data have

(1) different clothing from the testing data in every frame,

(2) the same clothing as the testing data in every frame, (3)

the same clothing as the testing data in some of the frames.

In phase 2, the ﬁnetuning is done over already ﬁnetuned

network of the phase 1. Otherwise it is done over the pre-

trained network.

# Name Initialization

network

Clothing in

training frames

1 phase 1 pre-train 2,3,4,5,6,7

2 phase 2 phase 1 1

3 full pre-train 1,2,3,4,5,6,7

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

340

Table 5: The results of pose estimation (PCK@0.2). The ﬁrst three cases uses pre-train validation samples (2000 images) in

testing, while other models use ﬁnetuning validation samples (4000 frames).

Network Head Wrist Elbow Shoulder Hip Knee Ankle All

Mean pose 31.1 18.9 8.5 11.8 10.0 40.8 33.5 21.4

Pre-train 84.2 41.6 60.5 76.9 72.8 62.6 53.7 63.1

Finetune (full) 77.5 22.2 42.9 49.8 52.5 42.6 38.6 44.2

Pre-train 86.1 24.5 64.1 86.8 88.0 82.5 64.0 69.6

Finetune (phase 1) 95.3 67.4 87.3 98.4 96.3 96.4 95.5 90.6

Finetune (phase 2) 99.6 88.1 95.6 99.9 97.3 98.5 98.5 96.6

Finetune (full) 99.3 89.2 95.9 99.7 97.6 98.5 98.6 96.8

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Head

finetune (full) finetune (phase 2) finetune (phase 1) pre-train finetune (full) pre-train mean pose

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Wrist

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Elbow

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Shoulder

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Hip

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Knee

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

Ankle

0 0.04 0.08 0.12 0.16 0.2

Normalized distance

100

Detection rate (%)

All

Figure 4: The results of pose estimation (PCK@0.2). The dashed lines uses pre-train validation samples (2000 images) in

testing, while the solid lines use ﬁnetuning validation samples (4000 frames). To put it other way, the dashed lines represent

the accuracy of generalization, while the solid lines represent the use case speciﬁc accuracy. The label indicates which network

is used in testing.

As for future work, there are several things that

could be considered in order to get better accuracy.

One option would be to use current network as a co-

arse estimator and use another network for reﬁning

the pose estimation. In addition, as our method is tar-

geted for video inputs, the utilization of the spatiotem-

poral data would most likely give accuracy boost. The

network forward time of the person detector is relati-

vely slow compared to the pose estimation network

(16ms vs. 60ms/200ms). While the person detec-

tor works well with diverse input data, perhaps, with

most pose estimation use cases, that is not necessary.

By using more restricted and possibly faster person

detector, a good enough performance in more con-

strained environments could be most likely achieved.

Also, with ConvNets, generally, holds that if more

data used in training, the better performance gained.

Hence, the use of more advanced data augmentation

methods, such as (Pishchulin et al., 2012), especially

in the ﬁnetuning, would most probably lead to better

accuracy. Advanced data augmentation could, for ex-

ample, change colors of the clothes, adjust limb poses

and change backgrounds.

REFERENCES

Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B.

(2014). 2d human pose estimation: New benchmark

and state of the art analysis. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

Andriluka, M., Roth, S., and Schiele, B. (2009). Pictorial

structures revisited: People detection and articulated

pose estimation. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 1014–

1021.

Buehler, P., Everingham, M., Huttenlocher, D. P., and Zis-

serman, A. (2011). Upper body detection and tracking

in extended signing sequences. International Journal

of Computer Vision (IJCV), 95(2):180–197.

Carreira, J., Agrawal, P., Fragkiadaki, K., and Malik, J.

(2015). Human pose estimation with iterative error

feedback. arXiv preprint arXiv:1507.06550.

Real-time Human Pose Estimation with Convolutional Neural Networks

341

Charles, J., Pﬁster, T., Everingham, M., and Zisserman, A.

(2013). Automatic and efﬁcient human pose estima-

tion for sign language videos. International Journal

of Computer Vision (IJCV).

Charles, J., Pﬁster, T., Magee, D., Hogg, D., and Zisser-

man, A. (2016). Personalizing video pose estimation.

In IEEE Conference on Computer Vision and Pattern

Recognition (CVPR).

Dantone, M., Gall, J., Leistner, C., and van Gool, L. (2013).

Human pose estimation using body parts dependent

joint regressors. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 3041–

3048, Portland, OR, USA.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), pages 1–8.

Glorot, X. and Bengio, Y. (2010). Understanding the dif-

ﬁculty of training deep feedforward neural networks.

In International Conference on Artiﬁcial Intelligence

and Statistics (AISTATS’10).

Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M.,

and Schiele, B. (2016). Deepercut: A deeper, stronger,

and faster multi-person pose estimation model. arXiv

preprint arXiv:1605.03170.

Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., and

Bregler, C. (2013). Learning human pose estimation

features with convolutional networks. arXiv preprint

arXiv:1312.7302.

Jain, A., Tompson, J., LeCun, Y., and Bregler, C. (2014).

Modeep: A deep learning framework using motion

features for human pose estimation. In Asian Con-

ference on Computer Vision (ACCV), pages 302–315.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. arXiv preprint arXiv:1408.5093.

Johnson, S. and Everingham, M. (2010). Clustered pose and

nonlinear appearance models for human pose estima-

tion. In British Machine Vision Conference (BMVC).

doi:10.5244/C.24.12.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Advances in Neural Information Pro-

cessing Systems (NIPS), pages 1097–1105.

Lifshitz, I., Fetaya, E., and Ullman, S. (2016). Human pose

estimation using deep consensus voting. arXiv pre-

print arXiv:1603.08212.

Newell, A., Yang, K., and Deng, J. (2016). Stacked hour-

glass networks for human pose estimation. arXiv pre-

print arXiv:1603.06937.

Pﬁster, T., Charles, J., and Zisserman, A. (2015). Flowing

convnets for human pose estimation in videos. In In-

ternational Conference on Computer Vision (ICCV).

Pﬁster, T., Simonyan, K., Charles, J., and Zisserman, A.

(2014). Deep convolutional neural networks for efﬁ-

cient pose estimation in gesture videos. In Asian Con-

ference on Computer Vision (ACCV).

Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B.,

Andriluka, M., Gehler, P., and Schiele, B. (2015).

Deepcut: Joint subset partition and labeling for

multi person pose estimation. arXiv preprint

arXiv:1511.06645.

Pishchulin, L., Jain, A., Andriluka, M., Thorm

ahlen, T., and

Schiele, B. (2012). Articulated people detection and

pose estimation: Reshaping the future. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 3178–3185.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. arXiv preprint arXiv:1506.01497.

Sapp, B. and Taskar, B. (2013). Modec: Multimodal de-

composable models for human pose estimation. In

IEEE Conference on Computer Vision and Pattern Re-

cognition (CVPR), pages 3674–3681.

Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Fi-

nocchio, M., Blake, A., Cook, M., and Moore, R.

(2013). Real-time human pose recognition in parts

from single depth images. Communications of the

ACM, 56(1):116–124.

Tompson, J., Goroshin, R., Jain, A., LeCun, Y., and Bregler,

C. (2015). Efﬁcient object localization using convolu-

tional networks. In IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 648–

656.

Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose

estimation via deep neural networks. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y.

(2016). Convolutional pose machines. arXiv preprint

arXiv:1602.00134.

Yang, Y. and Ramanan, D. (2011). Articulated pose esti-

mation with ﬂexible mixtures-of-parts. In IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 1385–1392.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

342