Non-optimal Semi-autonomous Agent Behavior Policy Recognition

Mathieu Lelerre, Abdel-Illah Mouaddib

Abstract

The coordination between cooperative autonomous agents is mainly based on knowing or estimating the behavior policy of each others. Most approaches assume that agents estimate the policies of the others by considering the optimal ones. Unfortunately, this assumption is not valid when we face the coordination problem between semi-autonomous agents where an external entity can act to change the behavior of the agents in a non-optimal way. We face such problems when the external entity is an operator guiding or tele-operating a system where many factors can affect the behavior of the operator such as stress, hesitations, preferences, ... In such situations the recognition of the other agent policies become harder than usual since considering all situations of hesitations or stress is not feasible. In this paper, we propose an approach able to recognize and predict future actions and behavior of such agents when they can follow any policy including non-optimal ones and different hesitations and preferences cases by using online learning techniques. The main idea of our approach is based on estimating, initially, the policy by the optimal one then we update it according to the observed behavior to derive a new estimated policy. In this paper, we present three learning methods of updating policies, show their stability and efficiency and compare them with existing approaches.

References

  1. Abdel-Illah Mouaddib, L. J. and Zilberstein, S. (2015). Handling advice in mdps for semi-autonomous systems. In ICAPS Woskhop on Planning and Robotics (PlanRob), pages 153-160.
  2. He, H., Eisner, J., and Daume, H. (2012). Imitation learning by coaching. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Advances in Neural Information Processing Systems 25, pages 3149-3157. Curran Associates, Inc.
  3. Hüttenrauch, H. and Severinson Eklundh, K. (2006). Beyond usability evaluation: Analysis of human-robot interaction at a major robotics competition. Interaction Studies, 7(3):455-477.
  4. Knox, W. and Stone, P. (2008). Tamer: Training an agent manually via evaluative reinforcement. In Development and Learning, 2008. ICDL 2008. 7th IEEE International Conference on, pages 292-297.
  5. Knox, W. B. and Stone, P. (2010). Combining manual feedback with subsequent MDP reward signals for reinforcement learning. In Proc. of 9th Int. Conf. on Autonomous Agents and Multiagent Systems (AAMAS 2010).
  6. Knox, W. B. and Stone, P. (2012). Reinforcement learning with human and mdp reward. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012).
  7. Monderer, D. and Shapley, L. S. (1996). Potential games. Games and Economic Behavior, 14(1):124-143.
  8. Nair, R., Tambe, M., Yokoo, M., Pynadath, D. V., and Marsella, S. (2003). Taming decentralized pomdps: Towards efficient policy computation for multiagent settings. In IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9-15, 2003, pages 705-711.
  9. Panagou, D. and Kumar, V. (2014). Cooperative Visibility Maintenance for Leader-Follower Formations in Obstacle Environments. Robotics, IEEE Transactions on, 30(4):831-844.
  10. Paruchuri, P., Pearce, J. P., Marecki, J., Tambe, M., Ordonez, F., and Kraus, S. (2008). Playing games for security: An efficient exact algorithm for solving bayesian stackelberg games. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS 7808, pages 895-902, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
  11. Pashenkova, E., Rish, I., and Dechter, R. (1996). Value iteration and policy iteration algorithms for markov decision problem. In AAAI'96: Workshop on Structural Issues in Planning and Temporal Reasoning. Citeseer.
  12. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition.
  13. Shiomi, M., Sakamoto, D., Kanda, T., Ishi, C. T., Ishiguro, H., and Hagita, N. (2008). A semi-autonomous communication robot: a field trial at a train station. In Proceedings of the 3rd ACM/IEEE International Conference on Human Robot Interaction, pages 303-310, New York, NY, USA. ACM.
  14. Sigaud, O. and Buffet, O. (2010). Markov Decision Processes in Artificial Intelligence. Wiley-ISTE.
  15. Sutton, R. S. and Barto, A. G. (1998). Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA.
  16. Vorobeychik, Y., An, B., and Tambe, M. (2012). Adversarial patrolling games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 3, AAMAS 7812, pages 1307-1308, Richland, SC. International Foundation for Autonomous Agents and Multiagent Systems.
  17. Watkins, C. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279-292.
Download


Paper Citation


in Harvard Style

Lelerre M. and Mouaddib A. (2016). Non-optimal Semi-autonomous Agent Behavior Policy Recognition . In Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 1: ECTA, (IJCCI 2016) ISBN 978-989-758-201-1, pages 193-200. DOI: 10.5220/0006054401930200


in Bibtex Style

@conference{ecta16,
author={Mathieu Lelerre and Abdel-Illah Mouaddib},
title={Non-optimal Semi-autonomous Agent Behavior Policy Recognition},
booktitle={Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 1: ECTA, (IJCCI 2016)},
year={2016},
pages={193-200},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006054401930200},
isbn={978-989-758-201-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Computational Intelligence - Volume 1: ECTA, (IJCCI 2016)
TI - Non-optimal Semi-autonomous Agent Behavior Policy Recognition
SN - 978-989-758-201-1
AU - Lelerre M.
AU - Mouaddib A.
PY - 2016
SP - 193
EP - 200
DO - 10.5220/0006054401930200