ses and 3D camera motions simultaneously. For using
the sequential pr operties of human poses and camera
motions, we combined CNN with LSTM, and sho-
wed that they can represent sequential properties in
input data properly. We also showed that the network
structure which uses 2 separate LSTMs for 3D pose
estimation and camera motion estimatio n is efficient.
REFERENCES
Agarwal, A. and Triggs, B. (2004). 3d human pose from
silhouettes by relevance vector regression. In Proc.
CVPR.
Chen, C.-H. and Ramanan, D. (2017). 3d human pose es-
timation = 2d pose esti mation + matching. In Proc.
CVPR, pages 7035–7043.
Guan, P., Balan, A. W. A., and Black, M. (2009). Estimating
human shape and pose from a single image. In Proc.
ICCV.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural Computation, 9(8):1735–1780.
Lab, C. G. (2003). Mocap: Motion capture database. In
http://mocap.cs.cmu.edu/.
Le, Q. V. (2013). Building high-level features using large
scale unsupervised learning. In Proc. International
Conference on Acoustics, Speech and Signal Proces-
sing, pages 8595–8598.
Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. (2011).
Ica with reconstruction cost for efficient overcomplete
feature learning. In Advances in Neural Information
Processing Systems, pages 1017–1025.
LeCun, Y. , Boser, B., Denker, J. S., Henderson, D., Ho-
ward, R. E., Hubbard, W., and Jackel, L. D. (1989).
Backpropagation applied to handwritten zip code re-
cognition. Neural Computation, 1(4):541–551.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Lin, M., Lin, L., Liang, X., Wang, K., and Cheng, H.
(2017). Recurrent 3d pose sequence machines. In
Proc. CVPR.
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H.,
Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., and The-
obalt, C. (2017). Vnect: Real-time 3d human pose
estimation with a single rgb camera. In Proc. SIG-
GRAPH.
Mikolov, T., Karafia, M., Burget, L., Cernocky, J., and Khu-
danpur, S. (2010). Recurrent neural network based
language model. In Proc. INTERSPEECH.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,
M., Moore, R., Kipman, A., and Blake, A. (2011).
Real-time human pose recognition in parts from single
depth images. In Proc. CVPR.
Sminchisescu, C. and Telea, A. (2002). Human pose esti-
mation from silhouettes : a consistent approach using
distance level sets. In Proc. International Conference
in Central Europe on Computer Graphics, Visualiza-
tion and Computer Vision.
Taylor, G., Fergus, R., LeCun, Y., and Bregler, C. (2010).
Convolutional learning of spatio-temporal features.
Proc. ECCV, pages 140–153.
Tome, D., Russell, C ., and Agapito, L. (2017). Lifting from
the deep: Convolutional 3d pose estimation from a
single image. In Proc. CVPR.
Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose
estimation via deep neural networks. In Proc. CVPR,
pages 1653–1660.