这是indexloc提供的服务,不要输入任何密码
Skip to main content
Springer Nature Link
Log in
Menu
Find a journal Publish with us Track your research
Search
Cart
  1. Home
  2. Machine Learning
  3. Article

Finite-time Analysis of the Multiarmed Bandit Problem

  • Published: May 2002
  • Volume 47, pages 235–256, (2002)
  • Cite this article
Download PDF
Machine Learning Aims and scope Submit manuscript
Finite-time Analysis of the Multiarmed Bandit Problem
Download PDF
  • Peter Auer1,
  • Nicolò Cesa-Bianchi2 &
  • Paul Fischer3 
  • 50k Accesses

  • 4131 Citations

  • 49 Altmetric

  • 3 Mentions

  • Explore all metrics

Abstract

Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.

Article PDF

Download to read the full article text

Similar content being viewed by others

Multi-armed bandits with censored consumption of resources

Article Open access 16 November 2022

Allocation Strategies Based on Possibilistic Rewards for the Multi-armed Bandit Problem: A Numerical Study and Regret Analysis

Chapter © 2018

Constrained regret minimization for multi-criterion multi-armed bandits

Article 06 January 2023

Explore related subjects

Discover the latest articles, books and news in related subjects, suggested using machine learning.
  • Continuous Optimization
  • Discrete Optimization
  • Learning algorithms
  • Operations Research and Decision Theory
  • Stochastic Learning and Adaptive Control
  • Calculus of Variations and Optimization
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

  • Agrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27, 1054–1078.

    Google Scholar 

  • Berry, D., & Fristedt, B. (1985). Bandit problems. London: Chapman and Hall.

    Google Scholar 

  • Burnetas, A., & Katehakis, M. (1996). Optimal adaptive policies for sequential allocation problems. Advances in Applied Mathematics, 17:2, 122–142.

    Google Scholar 

  • Duff, M. (1995). Q-learning for bandit problems. In Proceedings of the 12th International Conference on Machine Learning (pp. 209-217).

  • Gittins, J. (1989). Multi-armed bandit allocation indices, Wiley-Interscience series in Systems and Optimization. New York: John Wiley and Sons.

    Google Scholar 

  • Holland, J. (1992). Adaptation in natural and artificial systems. Cambridge: MIT Press/Bradford Books.

    Google Scholar 

  • Ishikida, T., & Varaiya, P. (1994). Multi-armed bandit problem revisited. Journal of Optimization Theory and Applications, 83:1, 113–154.

    Google Scholar 

  • Lai, T., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6, 4–22.

    Google Scholar 

  • Pollard, D. (1984). Convergence of stochastic processes. Berlin: Springer.

    Google Scholar 

  • Sutton, R., & Barto, A. (1998). Reinforcement learning, an introduction. Cambridge: MIT Press/Bradford Books.

    Google Scholar 

  • Wilks, S. (1962). Matematical statistics. New York: John Wiley and Sons.

    Google Scholar 

  • Yakowitz, S., & Lowe, W. (1991). Nonparametric bandit methods. Annals of Operations Research, 28, 297–312.

    Google Scholar 

Download references

Author information

Authors and Affiliations

  1. University of Technology Graz, A-8010, Graz, Austria

    Peter Auer

  2. DTI, University of Milan, via Bramante 65, I-26013, Crema, Italy

    Nicolò Cesa-Bianchi

  3. Lehrstuhl Informatik II, Universität Dortmund, D-44221, Dortmund, Germany

    Paul Fischer

Authors
  1. Peter Auer
    View author publications

    Search author on:PubMed Google Scholar

  2. Nicolò Cesa-Bianchi
    View author publications

    Search author on:PubMed Google Scholar

  3. Paul Fischer
    View author publications

    Search author on:PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47, 235–256 (2002). https://doi.org/10.1023/A:1013689704352

Download citation

  • Issue date: May 2002

  • DOI: https://doi.org/10.1023/A:1013689704352

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • bandit problems
  • adaptive allocation rules
  • finite horizon regret
Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Advertisement

Search

Navigation

  • Find a journal
  • Publish with us
  • Track your research

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Journal finder
  • Publish your research
  • Language editing
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our brands

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Discover
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support
  • Legal notice
  • Cancel contracts here

23.94.208.52

Not affiliated

Springer Nature

© 2025 Springer Nature