Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO)

## Purpose
The purpose of this issue(discussion) is to introduce a series of pr in the near future targeted to releasing tianshou's benchmark for MuJoCo Gym task suite using onpolicy algorithms already supported by tianshou(VPG, A2C, PPO).
## Introduction
This issue is closely related #274, which mainly focus on benchmarking mujoco environments using offpolicy algorithm and enhancing tianshou along the way. Since Mujoco is a widely used task suite in literature, and most drl libraries haven't released satisfying benchmark on such important tasks. (While there are open-source implementations available that can be used by practitioners, lacking of graphs, source data, comparasion, and specific deatils still make it uneasy for starters to compare performance between algorithms.) We decided to try buiding a complete benchmark for MuJoCo Gym task suite. First step is #274, and the second is this one.

The most closely related work to ours is probably [spining up(Pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html), which benchmarked 3 onpolicy algorithms(VPG, PPO, TRPO) on 5 mujoco environments, while our benchmark will try supporting more algorithms on 9 out of 13 environments. (Pusher, Thrower, Striker and HumanoidStandup have not been considered to be supported because they are not commonly seen in literatures.) While spinningup is intended for beginners and thus hasn't implemented standard tricks in drl (such as observation normalization and normalized value regression targets), we intend to do so.

Beyond that, like [offpolicy benchmark](https://github.com/thu-ml/tianshou/tree/master/examples/mujoco). for each supported algorithm, we will try providing:
- Default hyperparameters used for benchmark and scripts to reproduce the benchmark.
- A comparasion of performance(or code level details) with other open source implementations or classic papers.
- graphs and raw data that can be used for research purposes.
- Log details obtained during training.
- Pretrained agents.
- Some hints on how to tune the algorithm.

I make a plan and hope to finish the tasks above in the coming few weeks. Some features of tianshou will be enhanced along the way. I have done some experiments on my [fork](https://github.com/ChenDRAG/tianshou/tree/onmujoco_valuenorm) of Tianshou, which makes the benchmark below of PPO.

|Environment| Tianshou| [ikostrikov/pytorch-a2c-ppo-acktr-gail](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail)| [PPO paper](https://arxiv.org/pdf/1707.06347.pdf)|[baselines](http://htmlpreview.github.io/?https://github.com/openai/baselines/blob/master/benchmarks_mujoco1M.htm)|[spinningup(pytorch)](https://spinningup.openai.com/en/latest/spinningup/bench.html)|
| :---------------: | :---------------: | :---------------: | :---------------: | :---------------: |:---------------: |
|Ant|**3253.5+-894.3** | N | N | N | ~650 |
|HalfCheetah|**5460.5+-1319.3** | ~3120 | ~1800 | ~1700 | ~1670 |
|Hopper|2139.7+-776.6 | ~2300| ~2330 | **~2400** | ~1850 |
|Walker2d|3407.3+-1011.3 | **~4000** | ~3460 | ~3510 | ~1230 |
|Swimmer|99.8+-33.9 | N | ~108 | ~111 | **~120** |
|Humanoid|**597.4+-29.7** | N | N | N | N |
|Reacher|**-3.9+-0.4** | ~-5 | ~-7 | ~-6 | N |
|InvertedPendulum| N | N | **~1000** | ~940 | N |
|InvertedDoublePendulum|**8407.6+-1136.0** | N | ~8000 | ~7350 | N |

(This figure is outdated by now check examples/mujoco for better SOTA benchmark)

\* Reward metric: The meaning of the table value is the max average return over 10 trails(different seeds) ± a single standard deviation over trails. Each trial is averaged on another 10 test seeds. Only the first 1M steps data will be considered. ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. The shaded region on the graph also represents a single standard deviation. (Note that in TD3 paper shaded region represents only half of that)

\*\* ~ means the number is approximated from the graph because accurate numbers is not provided in the paper. N means graphs not provided.

\*\*\* We used the latest version of all mujoco environments in gym(0.17.3), but it's not often the case with other papers. Please check for details yourself in the original paper. (Different version's outcomes are usually similar, though)

Example graph:
![figure](https://user-images.githubusercontent.com/40993476/110314851-60323780-8043-11eb-81d1-4aab31cc6fa6.png)

## Plans
Here I briefly list 7 to do prs:

1.  Add a observation normalization wrapper of vecenv, because obs normalization affect onpolicy algorithm much.

2. Refactor VPG algorithm
- support value normalization and obs normalization.
- try different versions of VPG algorithm and make a benchmark on mujoco. 

3. Probably support natural policy gradient and make a benchmark.

4. Possibly support learning rate decay.

5.  Refactor A2C algorithm and make a benchmark

6. Refactor and benchmark PPO algorithm

7. Other enhancement
- providing drawing tools to reproduce benchmark to fix #161 .
- more loggger that can be used right away.
- maybe a fine_tuned version of ppo to use tricks guided by this [paper](https://arxiv.org/pdf/2006.05990.pdf).

## Future work
1.. Adding support(benchmark in the same way) for other algorithms(TRPO, ACER etc).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Purpose

Introduction

Plans

Future work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment	Tianshou	ikostrikov/pytorch-a2c-ppo-acktr-gail	PPO paper	baselines	spinningup(pytorch)
Ant	3253.5+-894.3	N	N	N	~650
HalfCheetah	5460.5+-1319.3	~3120	~1800	~1700	~1670
Hopper	2139.7+-776.6	~2300	~2330	~2400	~1850
Walker2d	3407.3+-1011.3	~4000	~3460	~3510	~1230
Swimmer	99.8+-33.9	N	~108	~111	~120
Humanoid	597.4+-29.7	N	N	N	N
Reacher	-3.9+-0.4	~-5	~-7	~-6	N
InvertedPendulum	N	N	~1000	~940	N
InvertedDoublePendulum	8407.6+-1136.0	N	~8000	~7350	N

Plans of releasing mujoco benchmark of onpolicy algorithms(VPG, A2C, PPO) #307

Description

Purpose

Introduction

Plans

Future work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions