Accurate Synchronization of Gesture and Speech for Conversational Agents using Motion Graphs

Jianfeng Xu, Yuki Nagai, Shinya Takayama, Shigeyuki Sakazawa

2014

Abstract

Multimodal representation of conversational agents requires accurate synchronization of gesture and speech. For this purpose, we investigate the important issues in synchronization as a practical guideline for our algorithm design through a precedent case study and propose a two-step synchronization approach. Our case study reveals that two issues (i.e. duration and timing) play an important role in the manual synchronizing of gesture with speech. Considering the synchronization problem as a motion synthesis problem instead of a behavior scheduling problem used in the conventional methods, we use a motion graph technique with constraints on gesture structure for coarse synchronization in a first step and refine this further by shifting and scaling the motion in a second step. This approach can successfully synchronize gesture and speech with respect to both duration and timing. We have confirmed that our system makes the creation of attractive content easier than manual creation of equal quality. In addition, subjective evaluation has demonstrated that the proposed approach achieves more accurate synchronization and higher motion quality than the state-of-the-art method.

References

  1. Arikan, O. and Forsyth, D. (2002). Interactive motion generation from examples. ACM Transactions on Graphics, 21(3):483-490.
  2. Beskow, J., Engwall, O., Granstrom, B., and Wik, P. (2004). Design strategies for a virtual language tutor. In INTERSPEECH-2004, pages 1693-1696.
  3. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., and Yan, H. (1999). Embodiment in conversational interfaces: Rea. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, CHI 7899, pages 520-527.
  4. Cassell, J., Vilhjálmsson, H. H., and Bickmore, T. (2001). Beat: the behavior expression animation toolkit. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, SIGGRAPH 7801, pages 477-486.
  5. Dutoit, T. (2001). An Introduction to Text-To-Speech Synthesis. Springer.
  6. Ekman, P., Friesen, W. V., and Hager, J. C. (2002). Facial Action Coding System: The Manual on CD ROM. A Human Face, Salt Lake City.
  7. Huang, J. and Pelachaud, C. (2012). Expressive body animation pipeline for virtual agent. In Intelligent Virtual Agents, volume 7502 of Lecture Notes in Computer Science, pages 355-362.
  8. Kopp, S., Krenn, B., Marsella, S., Marshall, A., Pelachaud, C., Pirker, H., Thórisson, K., and Vilhjlmsson, H. (2006). Towards a common framework for multimodal generation: The behavior markup language. In Intelligent Virtual Agents, volume 4133 of Lecture Notes in Computer Science, pages 205-217. Springer Berlin Heidelberg.
  9. Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion graphs. ACM Transactions on Graphics, 21(3):473- 482.
  10. Lee, J., Chai, J., Reitsma, P., Hodgins, J., and Pollard, N. (2002). Interactive control of avatars animated with human motion data. ACM Transactions on Graphics, 21(3):491-500.
  11. Marsella, S., Xu, Y., Lhommet, M., Feng, A., Scherer, S., and Shapiro, A. (2013). Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA 7813, pages 25-35.
  12. McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264:746 - 748.
  13. McNeill, D. (1985). So you think gestures are nonverbal? Psychological Review, 92(3):350-371.
  14. McNeill, D. (2005). Gesture and Thought. University of Chicago Press.
  15. Miller, L. M. and D'Esposito, M. (2005). Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. The Journal of Neuroscience, 25(25):5884-5893.
  16. Neff, M., Kipp, M., Albrecht, I., and Seidel, H.-P. (2008). Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics, 27(1):5:1-5:24.
  17. Ng-Thow-Hing, V., Luo, P., and Okita, S. (2010). Synchronized gesture and speech production for humanoid robots. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pages 4617-4624.
  18. Niewiadomski, R., Bevacqua, E., Mancini, M., and Pelachaud, C. (2009). Greta: an interactive expressive eca system. In Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS 7809, pages 1399- 1400.
  19. Nishida, T. (2007). Conversational Informatics: An Engineering Approach. John Wiley & Sons, Ltd.
  20. Noma, T., Zhao, L., and Badler, N. (2000). Design of a virtual human presenter. Computer Graphics and Applications, IEEE, 20(4):79-85.
  21. Oura, K., Yamamoto, D., Takumi, I., Lee, A., and Tokuda, K. (2013). On-campus, user-participatable, and voiceinteractive digital signage. Journal of The Japanese Society for Artificial Intelligence, 28(1):60-67.
  22. Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6):1161- 1178.
  23. Shoemake, K. (1985). Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, SIGGRAPH 7885, pages 245-254.
  24. Stone, M., DeCarlo, D., Oh, I., Rodriguez, C., Stere, A., Lees, A., and Bregler, C. (2004). Speaking with hands: creating animated conversational characters from recordings of human performance. ACM Transactions on Graphics, 23(3):506-513.
  25. van Luin, J., op den Akker, R., and Nijholt, A. (2001). A dialogue agent for navigation support in virtual reality. In CHI 7801 Extended Abstracts on Human Factors in Computing Systems, CHI EA 7801, pages 117-118, New York, NY, USA. ACM.
  26. C? erekovic?, A. and Pandz?ic?, I. (2011). Multimodal behavior realization for embodied conversational agents. Multimedia Tools and Applications, 54(1):143-164.
  27. Wang, J. and Bodenheimer, B. (2003). An evaluation of a cost metric for selecting transitions between motion segments. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics symposium on Computer animation, SCA 7803, pages 232-238, Aire-la-Ville, Switzerland, Switzerland. Eurographics Association.
  28. Xu, J., Takagi, K., and Sakazawa, S. (2011). Motion synthesis for synchronizing with streaming music by segment-based search on metadata motion graphs. In Multimedia and Expo (ICME), 2011 IEEE International Conference on, pages 1-6.
Download


Paper Citation


in Harvard Style

Xu J., Nagai Y., Takayama S. and Sakazawa S. (2014). Accurate Synchronization of Gesture and Speech for Conversational Agents using Motion Graphs . In Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-016-1, pages 5-14. DOI: 10.5220/0004748400050014


in Bibtex Style

@conference{icaart14,
author={Jianfeng Xu and Yuki Nagai and Shinya Takayama and Shigeyuki Sakazawa},
title={Accurate Synchronization of Gesture and Speech for Conversational Agents using Motion Graphs},
booktitle={Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2014},
pages={5-14},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004748400050014},
isbn={978-989-758-016-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - Accurate Synchronization of Gesture and Speech for Conversational Agents using Motion Graphs
SN - 978-989-758-016-1
AU - Xu J.
AU - Nagai Y.
AU - Takayama S.
AU - Sakazawa S.
PY - 2014
SP - 5
EP - 14
DO - 10.5220/0004748400050014