Deep Learning for Facial Keypoints Detection

Mikko Haavisto, Arto Kaarna, Lasse Lensu


A new area of machine learning research called deep learning has moved machine learning closer to one of its original goals: artificial intelligence and feature learning. Originally the key idea of training deep networks was to pretrain models in completely unsupervised way and then fine-tune the parameters for the task at hand using supervised learning. In this study, deep learning is applied to a facial keypoints detection. The task is to predict the positions of 15 keypoints on grayscale face images. Each predicted keypoint is specified by a real valued pair in the space of pixel coordinates. In the experiments, we pretrained a Deep Belief Network (DBN) and finally performed discriminative fine-tuning. We varied the depth and size of the network. We tested both deterministic and sampled hidden activations, and the effect of additional unlabeled data on pretraining. The experimental results show that our model provides better results than the publicly available benchmarks for the dataset.


  1. Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1-106.
  2. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1- 127.
  3. Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).
  4. Cho, K., Ilin, A., and Raiko, T. (2011). Improved learning of gaussian-bernoulli restricted boltzmann machines. In Proceedings of the 21th international conference on Artificial neural networks - Volume Part I, pages 10-17, Berlin, Heidelberg. Springer-Verlag.
  5. Dahl, G. (2012). Deep learning how I did it: Merck 1st place interview. http://blog.kaggle. com/2012/11/01/deep-learning-how-i-did-it-merck - 1st-place-interview/. Accessed September 29, 2014.
  6. Dahl, G., Yu, D., Deng, L., and Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30-42.
  7. Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.
  8. Hinton, G. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786):504-507.
  9. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771-1800.
  10. Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527-1554.
  11. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
  12. Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. In Journal of Machine Learning Research, pages 695-709.
  13. Jaitly, N., Nguyen, P., Senior, A. W., and Vanhoucke, V. (2012). Application of pretrained deep neural networks to large vocabulary speech recognition. In INTERSPEECH.
  14. Kaggle (2010). Machine learning competitions. Accessed September 29, 2014.
  15. Kaggle (2013). Facial keypoints detection. Accessed September 29, 2014.
  16. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1097-1105. Curran Associates, Inc.
  17. Lecun, Y., Jie, F., and Jhuangfu (2005). Loss functions for discriminative training of energy-based models. In Proc. of the 10-th International Workshop on Artificial Intelligence and Statistics.
  18. Lee, T. S., Mumford, D., Romero, R., and Lamme, V. A. (1998). The role of the primary visual cortex in higher level vision. Vision Research, 38(15/16):2429-2454.
  19. Luo, P., Wang, X., and Tang, X. (2012). Hierarchical face parsing via deep learning. In Conference on Computer Vision and Pattern Recognition, pages 2480- 2487. IEEE.
  20. Mnih, V. (2013). Q&A with job salary prediction first prize winner Vlad Mnih. Accessed September 29, 2014.
  21. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning representations by back-propagating errors. In Neurocomputing: Foundations of Research, pages 696-699. MIT Press, Cambridge, MA, USA.
  22. Salakhutdinov, R. and Hinton, G. (2009). Semantic hashing. International Journal of Approximate Reasoning, 50(7):969-978.
  23. Smolensky, P. (1986). Information processing in dynamical systems: foundations of harmony theory. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pages 194-281. MIT Press, Cambridge, MA, USA.
  24. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning, 15:1929-1958.
  25. Sun, Y., Wang, X., and Tang, X. (2013). Deep convolutional network cascade for facial point detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3476-3483. IEEE.
  26. Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, pages 1139-1147.
  27. Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661-1674.
  28. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.- A. (2008). Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pages 1096-1103, New York, NY, USA. ACM.
  29. Wang, N., Melchior, J., and Wiskott, L. (2012). An analysis of gaussian-binary restricted boltzmann machines for natural images. In Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 287-292.

Paper Citation

in Harvard Style

Haavisto M., Kaarna A. and Lensu L. (2015). Deep Learning for Facial Keypoints Detection . In Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015) ISBN 978-989-758-090-1, pages 289-296. DOI: 10.5220/0005272202890296

in Bibtex Style

author={Mikko Haavisto and Arto Kaarna and Lasse Lensu},
title={Deep Learning for Facial Keypoints Detection},
booktitle={Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)},

in EndNote Style

JO - Proceedings of the 10th International Conference on Computer Vision Theory and Applications - Volume 2: VISAPP, (VISIGRAPP 2015)
TI - Deep Learning for Facial Keypoints Detection
SN - 978-989-758-090-1
AU - Haavisto M.
AU - Kaarna A.
AU - Lensu L.
PY - 2015
SP - 289
EP - 296
DO - 10.5220/0005272202890296