# EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs

### Ronald Ortner

#### Abstract

This paper considers reinforcement learning problems with additional similarity information. We start with the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that this color information can be used to improve the dependency of online regret bounds on the number of arms. Further, we discuss to what extent this approach can be extended to the more general case of Markov decision processes. For the simplest case where the same color for actions means similar rewards and identical transition probabilities, an algorithm and a corresponding online regret bound are given. For the general case where transition probabilities of same-colored actions imply only close but not necessarily identical transition probabilities we give upper and lower bounds on the error by action aggregation with respect to the color information. These bounds also imply that the general case is far more difficult to handle.

#### References

- Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finitetime analysis of the multi-armed bandit problem. Mach. Learn., 47:235-256.
- Auer, P., Jaksch, T., and Ortner, R. (2009). Near-optimal regret bounds for reinforcement learning. In Adv. Neural Inf. Process. Syst. 21, pages 89-96. (full version http://www.unileoben.ac.at/~infotech/publications/ TR/CIT-2009-01.pdf).
- Hunter, J. J. (2006). Mixing times with applications to perturbed Markov chains. Linear Algebra Appl., 417:108-123.
- Kleinberg, R., Slivkins, A., and Upfal, E. (2008). Multiarmed bandits in metric spaces. In Proceedings STOC 2008, pages 681-690.
- Kleinberg, R. D. (2005). Nearly tight bounds for the continuum-armed bandit problem. In Adv. Neural Inf. Process. Syst. 17, pages 697-704.
- Mannor, S. and Tsitsiklis, J. N. (2004). The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623-648.
- Ortner, R. (2007). Pseudometrics for state aggregation in average reward Markov decision processes. In Proceedings of ALT 2007, pages 373-387.
- Pandey, S., Chakrabarti, D., and Agarwal, D. (2007). Multiarmed bandit problems with dependent arms. In Proceedings of ICML 2007, pages 721-728.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA.
- Tewari, A. and Bartlett, P. L. (2007). Bounded parameter Markov decision processes with average reward criterion. In Proceedings of COLT 2007, pages 263-277.

#### Paper Citation

#### in Harvard Style

Ortner R. (2010). **EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs** . In *Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,* ISBN 978-989-674-021-4, pages 203-210. DOI: 10.5220/0002703002030210

#### in Bibtex Style

@conference{icaart10,

author={Ronald Ortner},

title={EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs},

booktitle={Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},

year={2010},

pages={203-210},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0002703002030210},

isbn={978-989-674-021-4},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 2nd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,

TI - EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs

SN - 978-989-674-021-4

AU - Ortner R.

PY - 2010

SP - 203

EP - 210

DO - 10.5220/0002703002030210