Buffer should not store remapped action

For now, for example in PPO, actions are sampled from Gaussian distribution, which ranges from (-inf. inf). We then remap the action to (-1, 1) by clipping.

Do this, and you will find that the algorithm doesn't converge. This is actually a bug because when you clip the action, the log probability of your action changes, which make the original update equation incorrect, which will harm the performance severely in certain circumstances.
This is where the comment in test_ppo.py comes from:
```
policy = PPOPolicy(
    actor, critic, optim, dist, args.gamma,
    max_grad_norm=args.max_grad_norm,
    eps_clip=args.eps_clip,
    vf_coef=args.vf_coef,
    ent_coef=args.ent_coef,
    reward_normalization=args.rew_norm,
    # dual_clip=args.dual_clip,
    # dual clip cause monotonically increasing log_std :)
    value_clip=args.value_clip,
    # action_range=[env.action_space.low[0], env.action_space.high[0]],)
    # if clip the action, ppo would not converge :)
    gae_lambda=args.gae_lambda)
```
 https://github.com/thu-ml/tianshou/blob/ec23c7efe9a013dde69e7fd4a11538651574c8a7/test/continuous/test_ppo.py#L103

One way to fix this is not to store remapped action in the buffer and only regard it as something in the environment(like black box), and it's what gym/SB3 actually do.

I will start a pr shortly after to fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Buffer should not store remapped action #312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Buffer should not store remapped action #312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions