这是indexloc提供的服务,不要输入任何密码
Skip to content

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

@ChenDRAG

Description

@ChenDRAG

Purpose

The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing tianshou's benchmark for MuJoCo Gym task suite using onpolicy algorithms already supported by tianshou(VPG, A2C, PPO).

Introduction

This issue is closely related #274, which mainly focus on benchmarking mujoco environments using offpolicy algorithm and enhancing tianshou along the way. Since Mujoco is a widely used task suite in literature, and most drl libraries haven't released satisfying benchmark on such important tasks. (While there are open-source implementations available that can be used by practitioners, lacking of graphs, source data, comparasion, and specific deatils still make it uneasy for starters to compare performance between algorithms.) We decided to try buiding a complete benchmark for MuJoCo Gym task suite. First step is #274, and the second is this one.

The most closely related work to ours is probably spining up(Pytorch), which benchmarked 3 onpolicy algorithms(VPG, PPO, TRPO) on 5 mujoco environments, while our benchmark will try supporting more algorithms on 9 out of 13 environments. (Pusher, Thrower, Striker and HumanoidStandup have not been considered to be supported because they are not commonly seen in literatures.) While spinningup is intended for beginners and thus hasn't implemented standard tricks in drl (such as observation normalization and normalized value regression targets), we intend to do so.

Beyond that, like offpolicy benchmark. for each supported algorithm, we will try providing:

  • Default hyperparameters used for benchmark and scripts to reproduce the benchmark.
  • A comparasion of performance(or code level details) with other open source implementations or classic papers.
  • graphs and raw data that can be used for research purposes.
  • Log details obtained during training.
  • Pretrained agents.
  • Some hints on how to tune the algorithm.

I make a plan and hope to finish the tasks above in the coming few weeks. Some features of tianshou will be enhanced along the way. I have done some experiments on my fork of Tianshou, which makes the benchmark below of PPO.

Environment Tianshou ikostrikov/pytorch-a2c-ppo-acktr-gail PPO paper baselines spinningup(pytorch)
Ant 3253.5+-894.3 N N N ~650
HalfCheetah 5460.5+-1319.3 ~3120 ~1800 ~1700 ~1670
Hopper 2139.7+-776.6 ~2300 ~2330 ~2400 ~1850
Walker2d 3407.3+-1011.3 ~4000 ~3460 ~3510 ~1230
Swimmer 99.8+-33.9 N ~108 ~111 ~120
Humanoid 597.4+-29.7 N N N N
Reacher -3.9+-0.4 ~-5 ~-7 ~-6 N
InvertedPendulum N N ~1000 ~940 N
InvertedDoublePendulum 8407.6+-1136.0 N ~8000 ~7350 N

(This figure is outdated by now check examples/mujoco for better SOTA benchmark)

* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

** ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

*** We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

Example graph:
figure

Plans

Here I briefly list 7 to do prs:

  1. Add a observation normalization wrapper of vecenv, because obs normalization affect onpolicy algorithm much.

  2. Refactor VPG algorithm

  • support value normalization and obs normalization.
  • try different versions of VPG algorithm and make a benchmark on mujoco.
  1. Probably support natural policy gradient and make a benchmark.

  2. Possibly support learning rate decay.

  3. Refactor A2C algorithm and make a benchmark

  4. Refactor and benchmark PPO algorithm

  5. Other enhancement

  • providing drawing tools to reproduce benchmark to fix Provide curve-drawing examples #161 .
  • more loggger that can be used right away.
  • maybe a fine_tuned version of ppo to use tricks guided by this paper.

Future work

1.. Adding support(benchmark in the same way) for other algorithms(TRPO, ACER etc).

Metadata

Metadata

Labels

discussionDiscussion of a typical issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions