Abstract
Skill learning through reinforcement learning has significantly progressed in recent years. However, it often struggles to efficiently find optimal or near-optimal policies due to the inherent trial-and-error exploration in reinforcement learning. Although algorithms have been proposed to enhance skill learning efficacy, there is still much room for improvement in terms of skill learning performance and training stability. In this paper, we propose an algorithm called skill enhancement learning with knowledge distillation (SELKD), which integrates multiple actors and multiple critics for skill learning. SELKD employs knowledge distillation to establish a mutual learning mechanism among actors. To mitigate critic overestimation bias, we introduce a novel target value calculation method. We also perform theoretical analysis to ensure the convergence of SELKD. Finally, experiments are conducted on several continuous control tasks, illustrating the effectiveness of the proposed algorithm.
Similar content being viewed by others
References
Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 2018
Ibarz J, Tan J, Finn C, et al. How to train your robot with deep reinforcement learning: lessons we have learned. Int J Robotics Res, 2021, 40: 698–721
Luo F-M, Xu T, Lai H, et al. A survey on model-based reinforcement learning. Sci China Inf Sci, 2024, 67: 121101
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning. Nature, 2015, 518: 529–533
Liu N J, Cai Y H, Lu T, et al. Real-sim-real transfer for real-world robot control policy learning with deep reinforcement learning. Appl Sci, 2020, 10: 1555
Gu S X, Holly E, Lillicrap T, et al. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: Proceedings of IEEE International Conference on Robotics and Automation, 2017. 3389–3396
Haarnoja T, Pong V, Zhou A, et al. Composable deep reinforcement learning for robotic manipulation. In: Proceedings of IEEE International Conference on Robotics and Automation, 2018. 6244–6251
Levine S, Finn C, Darrell T, et al. End-to-end training of deep visuomotor policies. J Mach Learn Res, 2016, 17: 1334–1373
Fazeli N, Oller M, Wu J, et al. See, feel, act: hierarchical learning for complex manipulation skills with multisensory fusion. Sci Robot, 2019, 4: eaav3123
Liu N J, Lu T, Cai Y H, et al. Manipulation skill learning on multi-step complex task based on explicit and implicit curriculum learning. Sci China Inf Sci, 2022, 65: 114201
Ziebart B D. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Dissertation for Ph.D. Degree. Pittsburgh: Carnegie Mellon University, 2010
Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of International conference on machine learning, 2018. 1861–1870
Zhu Y, Wang Z, Merel J, et al. Reinforcement and imitation learning for diverse visuomotor skills. 2018. ArXiv:1802.09564
Hasselt H. Double Q-learning. In: Proceedings of Advances in Neural Information Processing Systems, 2010
van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning. In: Proceedings of AAAI Conference on Artificial Intelligence, 2016
Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods. In: Proceedings of International Conference on Machine Learning, 2018. 1587–1596
Lan Q, Pan Y, Fyshe A, et al. Maxmin Q-learning: controlling the estimation bias of Q-learning. In: Proceedings of International Conference on Learning Representations, 2020
Chen X Y, Wang C, Zhou Z J, et al. Randomized ensembled double Q-learning: learning fast without a model. In: Proceedings of International Conference on Learning Representations, 2021
Rusu A A, Colmenarejo S G, Gülçehre Ç, et al. Policy distillation. In: Proceedings of International Conference on Learning Representations, 2016. 1–13
Dillenbourg P. Collaborative Learning: Cognitive and Computational Approaches. New York: Elsevier Science, 1999
Littman M L. Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of Machine Learning Proceedings, 1994. 157–163
Hadfield-Menell D, Russell S J, Abbeel P, et al. Cooperative inverse reinforcement learning. In: Proceedings of Advances in Neural Information Processing Systems, 2016
Joshi T, Kodamana H, Kandath H, et al. TASAC: a twin-actor reinforcement learning framework with a stochastic policy with an application to batch process control. Control Eng Pract, 2023, 134: 105462
Lai K H, Zha D, Li Y, et al. Dual policy distillation. In: Proceedings of International Joint Conference on Artificial Intelligence, 2020. 3146–3152
Haarnoja T, Tang H, Abbeel P, et al. Reinforcement learning with deep energy-based policies. In: Proceedings of International Conference on Machine Learning, 2017. 1352–1361
Fox R, Pakman A, Tishby N. Taming the noise in reinforcement learning via soft updates. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, 2016. 202–211
Nair A, McGrew B, Andrychowicz M, et al. Overcoming exploration in reinforcement learning with demonstrations. In: Proceedings of IEEE International Conference on Robotics and Automation, 2018. 6292–6299
Torabi F, Warnell G, Stone P. Behavioral cloning from observation. In: Proceedings of International Joint Conference on Artificial Intelligence, 2018. 4950–4957
Popov I, Heess N, Lillicrap T, et al. Data-efficient deep reinforcement learning for dexterous manipulation. 2017. ArXiv:1704.03073
Kumar A, Gupta A, Levine S. DisCor: corrective feedback in reinforcement learning via distribution correction. In: Proceedings of Advances in Neural Information Processing Systems, 2020. 18560–18572
Czarnecki W M, Pascanu R, Osindero S, et al. Distilling policy distillation. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2019. 1331–1340
Zhao C, Hospedales T. Robust domain randomised reinforcement learning through peer-to-peer distillation. In: Proceedings of Asian Conference on Machine Learning, 2021. 1237–1252
Anschel O, Baram N, Shimkin N. Averaged-DQN: variance reduction and stabilization for deep reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2017. 176–185
Agarwal R, Schuurmans D, Norouzi M. An optimistic perspective on offline reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2020. 104–114
Lee K, Laskin M, Srinivas A, et al. SUNRISE: a simple unified framework for ensemble learning in deep reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2021. 6131–6141
Wu Y, Chen X, Wang C, et al. Aggressive Q-learning with ensembles: achieving both high sample efficiency and high asymptotic performance. In: Proceedings of Advances in Neural Information Processing Systems, 2022
Yang Z, Ren K, Luo X, et al. Towards applicable reinforcement learning: improving the generalization and sample efficiency with policy ensemble. In: Proceedings of International Joint Conference on Artificial Intelligence, 2022
Li Q, Kumar A, Kostrikov I, et al. Efficient deep reinforcement learning requires regulating overfitting. In: Proceedings of International Conference on Learning Representations, 2022
Sheikh H, Frisbee K, Phielipp M. DNS: determinantal point process based neural network sampler for ensemble reinforcement learning. In: Proceedings of International Conference on Machine Learning, 2022. 19731–19746
Huang Z, Zhou S, Zhuang B, et al. Learning to run with actor-critic ensemble. 2017. ArXiv:1712.08987
Wang H, Yu Y, Jiang Y. Review of the progress of communication-based multi-agent reinforcement learning (in Chinese). Sci Sin Inform, 2022, 52: 742–764
Li J C, Wu F, Shi H B, et al. A collaboration of multi-agent model using an interactive interface. Inf Sci, 2022, 611: 349–363
Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning. In: Proceedings of International Conference on Learning Representations, 2016
Singh S, Jaakkola T, Littman M L, et al. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learn, 2000, 38: 287–308
Brockman G, Cheung V, Pettersson L, et al. OpenAI Gym. 2016. ArXiv:1606.01540
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015
Acknowledgements
This work was supported by “New Generation Artificial Intelligence” Key Field Research and Development Plan of Guangdong Province (Grant No. 2021B0101410002), National Science and Technology Major Project of the Ministry of Science and Technology of China (Grant No. 2018AAA0102900), and National Natural Science Foundation of China (Grant Nos. U22A2057, 62133013).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, N., Sun, F., Fang, B. et al. Skill enhancement learning with knowledge distillation. Sci. China Inf. Sci. 67, 182203 (2024). https://doi.org/10.1007/s11432-023-4016-0
Received:
Revised:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s11432-023-4016-0