Finite-time Analysis of the Multiarmed Bandit Problem

Auer, Peter; Cesa-Bianchi, Nicolò; Fischer, Paul

doi:10.1023/A:1013689704352

Finite-time Analysis of the Multiarmed Bandit Problem

Published: May 2002

Volume 47, pages 235–256, (2002)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Finite-time Analysis of the Multiarmed Bandit Problem

Download PDF

50k Accesses
4131 Citations
49 Altmetric
3 Mentions
Explore all metrics

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Article PDF

Multi-armed bandits with censored consumption of resources

Article Open access 16 November 2022

Allocation Strategies Based on Possibilistic Rewards for the Multi-armed Bandit Problem: A Numerical Study and Regret Analysis

Constrained regret minimization for multi-criterion multi-armed bandits

Article 06 January 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.
Google Scholar
Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.
Google Scholar
Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122–142.
Google Scholar
Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).
Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.
Google Scholar
Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.
Google Scholar
Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113–154.
Google Scholar
Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.
Google Scholar
Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.
Google Scholar
Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.
Google Scholar
Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.
Google Scholar
Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297–312.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Technology Graz, A-8010, Graz, Austria
Peter Auer
DTI, University of Milan, via Bramante 65, I-26013, Crema, Italy
Nicolò Cesa-Bianchi
Lehrstuhl Informatik II, Universität Dortmund, D-44221, Dortmund, Germany
Paul Fischer

Authors

Peter Auer
View author publications
Search author on:PubMed Google Scholar
Nicolò Cesa-Bianchi
View author publications
Search author on:PubMed Google Scholar
Paul Fischer
View author publications
Search author on:PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47, 235–256 (2002). https://doi.org/10.1023/A:1013689704352

Download citation

Issue date: May 2002
DOI: https://doi.org/10.1023/A:1013689704352

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Finite-time Analysis of the Multiarmed Bandit Problem

Abstract

Article PDF

Similar content being viewed by others

Multi-armed bandits with censored consumption of resources

Allocation Strategies Based on Possibilistic Rewards for the Multi-armed Bandit Problem: A Numerical Study and Regret Analysis

Constrained regret minimization for multi-criterion multi-armed bandits

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article