Recovering 3D Human Poses and Camera Motions from Deep Sequence

Takashi Shimizu, Fumihiko Sakaue and Jun Sato

Department of Computer Science and Engineering, Nagoya Institute of Technology,

Gokiso, Showa, Nagoya 466-8555, Japan

Keywords:

Human Poses, Camera Motions, CNN , RNN, LSTM, Deep Learning.

Abstract:

In this paper, we propose a novel method for recovering 3D human poses and camera motions from sequential

images by using CNN and LSTM. The human pose estimation from deep learning has been studied extensively

in recent years. However, the existing methods aim to classify 2D human motions i n images. Although some

methods have been proposed for r ecovering 3D human poses recently, they only considered single fr ame poses,

and sequential properties of human actions were not used efﬁciently. Furthermore, the existing methods

recover only 3D poses relative to the viewpoints. In this paper, we propose a method for recovering 3D

human poses and 3D camera motions simultaneously from sequential input images. In our network, CNN is

combined with LSTM, so that the proposed network can learn sequential properties of 3D human poses and

camera motions efﬁciently. The efﬁciency of the proposed method is evaluated by using real images as well

as synthetic i mages.

1 INTRODUCTION

In recent years, huma n poses and actions are measu-

red and used in various applications, such as movies

and games. The motion capture systems are often

used for mea suring human poses and actions (Lab,

2003; Shotton et al., 2011). While the early mo-

tion capture system s (Lab, 2003) require special mar-

kers on the human body, recent systems such as Ki-

nect sensors (Shotton et al., 2011) do not need to

use markers. Although these motion capture sys-

tems are very useful for short range measurements

in well-maintained environments, they cannot be used

for long range m easurements or uncontrolled environ-

ments, such as outdoor scenes. In such situations, pas-

sive methods such as camera based pose recognition

methods are very useful.

For measuring human poses and actions f rom ca-

mera images, silhouette images were often u sed fo r

neglecting the texture of clothes etc. (Agarwal and

Triggs, 2004; Sminchisescu and Telea, 2002). The

shading information was also used fo r estimating 3D

poses from a single view (Guan et al., 2009). More

recently, the deep learning has been used for pose

estimation (Toshev and Szegedy, 2014). As shown

in many recent papers, the deep learning provides us

with the state of the art accuracy in var ious ﬁelds (Le-

Cun et al., 1989; LeCun et al., 1998; L e et al., 2011;

Le, 2013; Taylor et al., 2010), and the use of deep le-

arning in the human action recognition is promising.

Although many n eural nets have been proposed for

recogn izing 2D human poses a nd actions (Toshev and

Szegedy, 2014) , the research on neu ral nets for 3D hu-

man pose recovery has just started (Chen and Rama-

nan, 201 7; Tome et al., 2017; Lin et al., 2017; Mehta

et al., 2017), and it requires more work to obtain bet-

ter accuracy and to use in various situations. In par-

ticular, most of the current works on 3D human pose

recovery are based on a single image (Chen and Ra-

manan, 2017; Tome et al., 2 017; Mehta et al., 2017).

However, human poses are highly dependent in time,

and the sequential properties may be very useful to

recover 3D poses and actions.

Thus, in th is paper, we propose a novel me thod

for recovering 3D human poses from images by using

the sequential properties in 3D poses. For this ob-

jective, we combine the stan dard convolutional neu-

ral network (CNN) with Lo ng Short- Term Memory

(LSTM) (Ho c hreiter and Schmidh uber, 1997). The

LSTM can represent the sequential properties in 3D

human poses, and h ence our network can recover 3D

human pose at each time instant considering the se-

quence of human motions. As as result, our method

can recover 3D human poses, even if some body por-

tions are oc c luded by o ther b ody portions.

Furthermore, o ur network consid e rs not only 3D

human poses, but also 3D motio ns of a camera which

observes the human. For separating 3D human moti-

Shimizu, T., Sakaue, F. and Sato, J.

Recovering 3D Human Poses and Camera Motions from Deep Sequence.

DOI: 10.5220/0006718603930398

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

393-398

ISBN: 978-989-758-290-5

393

tanh

௧

tanh

௧

ି

ଵ

௧

ି

ଵ

tanh

௧

ା

ଵ

௧

ା

ଵ

Figure 1: The structure and the state transition of LSTM.

and h

denote input and output at time t. σ denotes a

sigmoid function, which acts as a gate of data ﬂ ow. By

controlling these gates, the LSTM can preserve sequential

information and learn time varying properties.

Deep CNN

Pose,

Motion

Input Image

Silhouette Image

Figure 2: The outline of human pose and camera motion

estimation.

ons and 3D camera motions, we ﬁx the basis 3D coor-

dinates at the waist of a human body, and 3D human

motions and 3D camera motions a re d e scribed based

on this basis coordinates. By using our method, we

can estimate 3D huma n poses and 3D camera moti-

ons simultaneously.

In section 2, we brieﬂy review the convolutional

neural network ( CNN) and the Long Short-Term Me-

mory (LSTM). In section 3, we propose a method for

estimating 3D human poses and camera motions by

combining CNN with LSTM. The results from the

proposed method are shown in section 4, and the con-

clusions are described in the ﬁnal section.

2 CNN AND LSTM

While the fully connected neural network learn th e

weight of connection between individual nodes in

adjacent layers, the convolutional neural network

(CNN) consists of convolution layers which connect

adjacent layers by convolution, and learn s the net-

work by optimizing the kernels of convolution. As

a result, CNN can optimize feature extraction from

images, which had been conducted by man made fe-

ature detectors such as SIFT and HOG traditionally.

Nowadays, CNN is the world standard in image re-

cognition and used in various applications.

Although CNN is very useful and efﬁcient in

image recognition, the output of CNN is determined

just from the current input images. As a result, it can-

not p rocess sequential data such as movies properly,

since the output of sequential data depends not only

on the current input, but also on the past input data.

Figure 3: 3D human body model and the DOF of each joint.

For learning sequential data, Recurrent Neural

Network (RNN) has been proposed (Mikolov et al.,

2010).The recurrent neural network p reserves sequen-

tial past data as the internal state, and can process

sequential data properly. For learning long ter m de-

pendency in seq uential data, Long Short-Term Me-

mory (LSTM) has also been proposed (Hoc hreiter and

Schmidhuber, 1997). While the o riginal RNN can

only process short term data, LSTM can learn long

term properties of data.

Fig. 1 show th e network structure and the state

transition o f LSTM. The LSTM controls lea rning pro -

cess by using gates, σ. The input gate contro ls in-

put from the previous time, and the output gate con-

trols the effect of the current layer to the next layer.

The forget gate c ontrols the destruction of data which

are no longer need ed. By c ontrolling the se gates, the

LSTM can preserve sequential inf ormation and learn

time varying properties in the data e fﬁciently.

3 HUMAN POSE AND CAMERA

MOTION ESTIMATION FROM

CNN AND LSTM

In this research, we combine CNN and LSTM for es-

timating 3D human poses and ca mera motions simul-

taneously. For avoiding the effect of th e variation of

backgr ound scenes, we ﬁrst transform camera images

into silhouette images of human bod y and use the sil-

houette images as the input of our network as shown

in Fig. 2.

3.1 Representation of 3D Human Poses

In this research 3D human poses are represented by a

set of rotation angles at body joints. Suppose we have

N joints in a human bod y. Then, since each joint has

3 rotation axes, the human pose can be represente d by

3N rotation parameters.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

394

Figure 4: Proposed network structure for 3D human pose

and camera motion estimation.

However, the rota tions of arms and legs around

their axes are irreleva nt to human poses. Therefore,

we consider e a ch of the upper a rms, lower ar ms, up-

per legs and lower legs has only 2 DOF, and only the

waist has 3 DOF. Thus, in this research, we represent

the 3D hum an pose by using 19 parameters as shown

in Fig. 3.

The world coordinates are ﬁxed at the waist of the

human body, an d 3D positions and orientations of all

the objects in the scene are represented by using the

waist based world coordinates.

3.2 Representation of Camera Motions

In this research, we assume that not only the human

body but also the camera whic h observes the human

body moves in the sequential observations. Thus, we

estimate camera positions as well as human poses at

each time instant. We assume that the viewing di-

rection of th e camera is ﬁxed to the center of the hu-

man body, i.e. waist, and the camera p ositions are

represented by using the orientation, θ, φ, and the dis-

tance d from the waist based world coordinates. The

camera c an also rotate around the viewing axis with

ω. Thus, the camera position and o rientation have 4

parameters.

In this research we estimate these 4 parameters of

camera motions as well as 19 parameters of human

poses. Hence, we estimate totally 23 parameters.

3.3 Network Structure

We next describe the network structure of the propo-

sed method. In this research, we combine CNN and

LSTM for estimating 3D human poses and camera

motions simultaneously by using the sequential pro-

perties of h uman motions and camera motions efﬁ-

ciently.

Suppose we have an input image x

from the ca-

mera at time t. Then our network estimates came ra

motion parameter s C

and human pose parameters P

at time t from the input image x

. Considering the

sequential prop erties of human poses and cam e ra mo-

tions, our network can be considered as a function F

which estimate the current state of th e network S

(a) input images (b) silhouette images

Figure 5: Examples of input images and silhouette images.

Figure 6: Changes in test loss in network training. The red

line shows the loss of the proposed branch net which uses

2 separate L STMs for human pose and camera motion, and

the green line shows the loss of a straight net which uses a

single LSTM for both human pose and camera motion.

well as the camera p arameters C

and human po se pa-

rameters P

from the current input image x

and the

previous state S

t−1

of the network as follows:

, P

, S

} = F(x

, S

t−1

) (1)

Thus, learning of the network is considered as the es-

timation of function F by regression analysis.

For realizing the estimation, ou r network consists

of 4 convolution layers, a pooling layer and 2 fully

connected layers followed by 2 sets of LSTMs and

fully connected layers as shown in Fig. 4. Our net-

work ﬁrst extract image features by using 4 convo-

lution layers and a pooling layer. Then 2 fully con-

nected layers transform the result into a low dimensi-

onal feature vector. Then, the result is separa te d and

analyzed by two different LSTM s, one for the esti-

mation of human pose parameters and the other for

the estimation of camera motion param eters. These

LSTMs derive feature parameters of human pose and

camera motions updating their internal state. Then,

the ﬁnal layers transform these feature parameters

into 19 human pose parameters and 4 ca mera motion

parameters.

In this network, we consider the transition of hu-

man pose and the transition of camera position are in-

dependent to each other, and estimate the human po-

ses and came ra motions by using 2 different LSTM s.

By using the LSTM, w e ca n estimate 3D human po-

Recovering 3D Human Poses and Camera Motions from Deep Sequence

395

Figure 7: The result of 3D human pose estimation. The es-

timated 3D human poses were reprojected into the original

images by using the esti mated camera motion.

ses efﬁciently, even if some body po rtions are occlu-

ded by other body portions, which happe ns often in

silhouette images. By learning the network from the

back propagation, we realize the simultaneous esti-

mation of human poses and camera motions.

3.4 Learning Network by using CG

Models

We next consider the training of our network . For

training the network avoiding overlearning, we ne ed

huge amount of training data in general. However,

it is not easy to obtain huge amo unt of image data of

human poses under various camera m otions in the real

scene. Therefore, we in this re search use sy nthetic

images generated by using CG models.

We generated human models with various body

shapes, and ad ded various pose parameters to them.

We also gene rated a virtual camera with various mo -

tions, and obser ved the human poses to generate se-

quential CG images. For generating the pose of hu-

man, we used Mocap database (Lab, 200 3) provided

by Carnegie Mellon University. The Mocap databa se

Figure 8: The result of camera motion estimation. The red

quadrangular pyramid shows the estimated camera positi-

ons and orientations, and the green quadrangular pyramid

shows the ground truth.

Table 1: The error of 3D human pose estimation and camera

motion estimation with and without LSTM.

human pose (

◦

) camera (m)

with LSTM 11.8 2.6

without LSTM 18.2 5.7

consists of 2605 different motions, such as walking,

dancing , playing spo rts etc. We used 2000 of them

for training and used 605 of them for testing in the

synthetic image experiments. The vir tual camera was

moved around the human body ﬁxating the viewing

direction to th e center of the world coordinates, i.e.

center of the waist of the human body.

The use of synthetic images enables us to learn

large variations of human pose parameters and ca-

mera mo tion parameters easily and efﬁciently. We

can also simulate various types of human body, and

control these parameters accord ing to the objective o f

application system s. By using the synthetic training

data, we train our network efﬁciently, a nd use it for

estimating human poses and camera motions simulta-

neously.

4 EXPERIMENTS

We next show the results of simultaneous estimation

of human poses and camera motions by using the pro-

posed network. The experime nts are conducted by

using synthetic images as well as real images.

In our experiments, a 3D human body mode l

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

396

(a) batting

(b) exercise

Figure 9: The result of 3D human pose estimation. The estimated 3D human poses were reprojected into the original images

by using the estimated camera motions.

shown in Fig. 3 was used for synthesizing training

images. As we exp la ined in section 3.4, we used Mo-

cap d atabase (La b, 2003) for generating 3D human

poses. The synthetic images were generated by chan-

ging human poses and camera motions. The image

size was 100 × 100. Fig. 5 shows som e examples of

synthetic images and their silhouette images. We ge-

nerated 9000 sets of 10 sequential imag e s from 2000

motions in Mocap database random ly, and used them

for training our network. We also generated 1200 sets

of 10 sequential images from the remaining 6 05 mo-

tions in Mocap database ran domly, and used them for

testing in the synthetic image experiment. The net-

work train ing was executed by using Caffe frame-

work.

We ﬁrst evaluated the efﬁciency of our network

structure, which uses 2 different LSTMs for huma n

pose estimation and camera motion estimation. For

compariso n, w e also evaluated a network which es-

timates human poses and camera motions by using a

single LSTM at the middle of our network shown in

Fig. 4. Fig. 6 shows the changes in test loss in these

2 networks. The red line shows the loss of the pro-

posed branch net which uses 2 separate LSTMs for

pose estimation and camera motio n estimation, and

the gr e en line shows the lo ss of a straight net which

uses a single LSTM for both pose estimation and ca-

mera motion estimation. As shown in this ﬁgur e , the

test loss of the proposed network decreases much fas-

ter than that of the straight net. This is because the

proposed network can learn the pose and motion pa-

rameters more efﬁciently without learning irrelevant

parameters by separating pose net a nd motion net.

We next show th e results of 3D human pose esti-

mation from synthetic images in Fig. 7. The estima-

ted 3D poses were reprojected into the original inp ut

images by using the estimated camera motions in this

ﬁgure. As shown in these ima ges, various p oses were

estimated well by using the proposed network. Fig. 8

shows the came ra motions e stima te d by the proposed

network. The red quadrangular pyramid shows the

estimated camera positions and orientations, and the

green quadrangular pyramid shows the groun d truth.

As shown in this ﬁgure, the 3D camera motions were

also estimated properly. The ac curacy of estimated

3D human poses and 3D camer a positions is as shown

in tab le 1. For comparison, we also evaluated the

accuracy of a network without LSTM. As shown in

this table, the proposed network with LSTM provides

us with much better a c curacy, and we ﬁnd th at the use

of sequential properties of pose and motion is very

important.

Finally, we show the results of 3D human pose

estimation from real image sequences. Fig. 9 shows

sequential images of batting motion and exercise mo-

tion, and the estimated 3D human poses projected

into images. The silhouette images were extracted by

using the bac kground subtra ction m ethod in these ex-

periments. Although there are som e estimation errors

in the output of our network, the estimated resu lts are

reasonable.

These results show that the proposed method ena-

bles us to estimate sequ ential 3D human poses and

camera motions proper ly.

5 CONCLUSION

In this paper, we proposed a novel method for re-

covering 3D human poses and camera motions from

sequential images by using CNN and LSTM. While

the existing methods recover just 3D poses relative to

the viewpoints, our method estimates 3D human po-

Recovering 3D Human Poses and Camera Motions from Deep Sequence

397

ses and 3D camera motions simultaneously. For using

the sequential pr operties of human poses and camera

motions, we combined CNN with LSTM, and sho-

wed that they can represent sequential properties in

input data properly. We also showed that the network

structure which uses 2 separate LSTMs for 3D pose

estimation and camera motion estimatio n is efﬁcient.

REFERENCES

Agarwal, A. and Triggs, B. (2004). 3d human pose from

silhouettes by relevance vector regression. In Proc.

CVPR.

Chen, C.-H. and Ramanan, D. (2017). 3d human pose es-

timation = 2d pose esti mation + matching. In Proc.

CVPR, pages 7035–7043.

Guan, P., Balan, A. W. A., and Black, M. (2009). Estimating

human shape and pose from a single image. In Proc.

ICCV.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computation, 9(8):1735–1780.

Lab, C. G. (2003). Mocap: Motion capture database. In

http://mocap.cs.cmu.edu/.

Le, Q. V. (2013). Building high-level features using large

scale unsupervised learning. In Proc. International

Conference on Acoustics, Speech and Signal Proces-

sing, pages 8595–8598.

Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. (2011).

Ica with reconstruction cost for efﬁcient overcomplete

feature learning. In Advances in Neural Information

Processing Systems, pages 1017–1025.

LeCun, Y. , Boser, B., Denker, J. S., Henderson, D., Ho-

ward, R. E., Hubbard, W., and Jackel, L. D. (1989).

Backpropagation applied to handwritten zip code re-

cognition. Neural Computation, 1(4):541–551.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

Lin, M., Lin, L., Liang, X., Wang, K., and Cheng, H.

(2017). Recurrent 3d pose sequence machines. In

Proc. CVPR.

Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H.,

Shaﬁei, M., Seidel, H.-P., Xu, W., Casas, D., and The-

obalt, C. (2017). Vnect: Real-time 3d human pose

estimation with a single rgb camera. In Proc. SIG-

GRAPH.

Mikolov, T., Karaﬁa, M., Burget, L., Cernocky, J., and Khu-

danpur, S. (2010). Recurrent neural network based

language model. In Proc. INTERSPEECH.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,

M., Moore, R., Kipman, A., and Blake, A. (2011).

Real-time human pose recognition in parts from single

depth images. In Proc. CVPR.

Sminchisescu, C. and Telea, A. (2002). Human pose esti-

mation from silhouettes : a consistent approach using

distance level sets. In Proc. International Conference

in Central Europe on Computer Graphics, Visualiza-

tion and Computer Vision.

Taylor, G., Fergus, R., LeCun, Y., and Bregler, C. (2010).

Convolutional learning of spatio-temporal features.

Proc. ECCV, pages 140–153.

Tome, D., Russell, C ., and Agapito, L. (2017). Lifting from

the deep: Convolutional 3d pose estimation from a

single image. In Proc. CVPR.

Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose

estimation via deep neural networks. In Proc. CVPR,

pages 1653–1660.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

398