这是indexloc提供的服务,不要输入任何密码
Skip to content

Plans of releasing mujoco benchmark with ddpg/sac/td3 on Tianshou #274

@ChenDRAG

Description

@ChenDRAG

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing a benchmark(sac, td3, ddpg) on mujoco environments. Some features of tianshou platform will be enhanced along the way.

Introduction

By the time this issue is proposed, tianshou platform has attracted 2.4k star on github, and has become a very popular deep rl library based purely on Pytorch(in contrast with openai baseline, rl lab, etc), thanks to the contributions of @Trinkle23897 @duburcqa, @youkaichao, etc. However, with the users growing day by day, some problems start to spring up. One critical problem is that although Tianshou is a fast speed, structured, flexible library and supports many classic algorithms officially, it has done a relatively poor job on benchmarking the algorithm it supports. Examples and demonstrations are mostly tested on toy environments of gym, and we have not yet provided detailed comparison and analysis with classic papers on officially supported algorithms, which might make users worry about the correctness and efficiency of algorithms, make it a bit hard for researchers using Tianshou to reproduce results of classic papers because of the lack of trustworthy hyperparameters(baseline in other words).

Tianshou hopes to provide users with a lightweight and efficient drl platform and reduce the burden of rl researchers as much as possible. Even if users are only starters and might not be so familiar with drl algorithms or baselines, they can design their own algorithm with minimal lines of code by inheriting and using official data/algorithm structures, understand source code and compare their idea with standard algorithms easily. In order to achieve this, one thing we have to do is to provide a detailed benchmark for widely used algorithms and environments.

This is what I have been trying to do, and the first step has been taken. Using tianshou, I have managed to create a state-of-the-art benchmark on three algorithms on mujoco's mostly widely used 9/14 environments.

ddpg

Environment Tianshou spining up(Pytorch) TD3 paper(ddpg) TD3 paper(our ddpg)
Ant 990.4±4.3 ~840 1005.3 888.8
HalfCheetah 11718.7±465.6 ~11000 3305.6 8577.3
Hopper 2197.0±971.6 ~1800 2020.5 1860.0
Walker2d 1400.6±905.0 ~1950 1843.6 3098.1
Swimmer 144.1±6.5 ~137 N N
Humanoid 177.3±77.6 N N N
Reacher -3.3±0.3 N -6.51 -4.01
InvertedPendulum 1000.0±0.0 N 1000.0 1000.0
InvertedDoublePendulum 8364.3±2778.9 N 9355.5 8370.0

td3

Environment Tianshou spining up(Pytorch) TD3 paper
Ant 5116.4±799.9 ~3800 4372.4±1000.3
HalfCheetah 10201.2±772.8 ~9750 9637.0±859.1
Hopper 3472.2±116.8 ~2860 3564.1±114.7
Walker2d 3982.4±274.5 ~4000 4682.8±539.6
Swimmer 104.2±34.2 ~78 N
Humanoid 5189.5±178.5 N N
Reacher -2.7±0.2 N -3.6±0.6
InvertedPendulum 1000.0±0.0 N 1000.0±0.0
InvertedDoublePendulum 9349.2±14.3 N 9337.5±15.0

sac

Environment Tianshou spining up(Pytorch) SAC paper
Ant 5850.2±475.7 ~3980 ~3720
HalfCheetah 12138.8±1049.3 ~11520 ~10400
Hopper 3542.2±51.5 ~3150 ~3370
Walker2d 5007.0±251.5 ~4250 ~3740
Swimmer 44.4±0.5 ~41.7 N
Humanoid 5488.5±81.2 N ~5200
Reacher -2.6±0.2 N N
InvertedPendulum 1000.0±0.0 N N
InvertedDoublePendulum 9359.5±0.4 N N

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

**** We didn't compare to OPENAI baselines, because for now I think its benchmark is corrupted(?), and I haven't been able to find the information I need. But in spining up docs they stated that "Spinning Up implementations of DDPG, TD3, and SAC are roughly at-parity with the best-reported results for these algorithms", so I guess lack of comparisons with OPENAI baselines is okay.

figure

I only show one figure here as an example, all other figures for tianshou mujoco benchmark can be found here.

To achieve the results is not easy, because it requires not only hyperparameter tuning, but some features of Tianshou platform have to be changed first, most of which are already mentioned in different issues by different users. For example:

There are also other problems issues haven't mentioned or I haven't noticed. For instance:

  • In trainer, log_interval for update step and env step can only be the same, which will cause inconvenience. A flexible logger might help.
  • In net utils, Net function can only create MLP in which all hidden layer numbers is the same.
  • In policies, some policies will add explore noise when evaluating the algorithm.
  • Buffer and Collector now in Tianshou is a little bit too complex because they try to support all features in one single class, which will cause great inconvenience when trying to understand source code inheriting from those class to create customized data structure.

All the problems above will be taken care of to a certain extent when trying to release the benchmark. Scripts that achieve this benchmark is hosted on my fork of Tianshou, and can be found here. However, it cannot be directly merged, because it is only what we use to demonstrate our idea, so it's not well organized (Lack of consistency, docs, comments, tests, etc.). Another reason is that this will be a big merge on Tianshou and we want to try hard to enhance Tianshou without causing too much interference for our users. As a result, I make a plan and hope to merge all the codes in 6 commits in total in the next few weeks. All of these commits are targeted to releasing the benchmark above eventually.

Plans

Here I briefly introduce what these 6 commits try to do.

  1. In net utils, enhance Net function, and make it support any type of MLP.
  • This is the most urgent commit because Net function will be needed in another pr.
  1. Minor fix of batch and adding a new ReplayBuffer class called CachedReplayBuffer.
  • CachedReplayBuffer is used to replace _cached_buf of Collector in the next commit, which is critical to solving the n_step problem mentioned in Traditional step collector implementation is needed #245.
  • change the definition of ReplayBuffer to certain management of Batch, because chronologically organized ReplayBuffer might not be suitable for all scenarios.
  • give all buffer types inheriting from ReplayBuffer the same API (indexing method for instance), let the developers worry about the underlying implementation of different types of ReplayBuffer, not the users.
  • [Probably] Separation of stack option and other abilities of ReplayBuffer, to make the source code easier to understand or rewrite. Gain efficiency at the same time.
  • docs, tests, etc.
  1. Refactor of collector to support both ReplayBuffer and CachedReplayBuffer.
  • fix Traditional step collector implementation is needed #245 by supporting CachedReplayBuffer and not allowing ReplayBuffer to work when n_env > 1.
  • removed those not widely used return info, make code more lightweight.
  • change BasePolicy to be prepared for the incoming change of indexing method of CachedReplayBuffer.
  • fix a bug in BasePolicy: when ignoring done and setting n_step > 1 in offpolicy algorithms, a small amount of targer q will have calculation error.
  • change the behavior of action noise, expl noise will all be added in collector from now on, making it easier to redefine, less possible to cause bug when added in forward function. Partly sovle Noisy network implementation #194.
  • little change in trainer, to coordinate collector's change.
  • docs, tests, etc.
  1. Refactor of trainer to add a self-defined logger in trainer.
  • add a logger in trainer which can be self-defined and will be used in benchmarking.
  • remove original log_interval, save_fn, writer, etc. (All logging function).
  • add a default logger which basically do all jobs of original logging function. Partly solve Provide curve-drawing examples #161.
  • docs, tests, etc.
  1. Some small fixes in tianshou/policy to make policies easier to use and add some standard tricks to it.
  • take consideration of gym's 'TimeLimit.truncated' flag, to make the policy more efficient.
  1. Releasing mujoco benchmark(source code, data, graphs, detailed comparison, analysis of hyperparameters, etc) on 3 algorithms.

Future work

  1. Remove warnings and implementations for originally supported but now unsupported methods.
  2. Adding support(benchmark in the same way) for other algorithms(VPG, PPO, TRPO, etc).
  3. Speed analysis, and provide a set of hyperparameters that can be trained in Parallel using tianshou to speed up training.
  4. Consider discrete-action environments like Atari (Maybe support rainbow on Tianshou).
  5. A tutorial on how to tune hyperparameters of certain rl problem.
  6. ......

Metadata

Metadata

Assignees

Labels

discussionDiscussion of a typical issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions